scott-routledge2 commented on code in PR #43661:
URL: https://github.com/apache/arrow/pull/43661#discussion_r2465960489
##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -649,8 +718,17 @@ Result<RecordBatchGenerator>
ParquetFileFormat::ScanBatchesAsync(
ARROW_ASSIGN_OR_RAISE(auto generator, reader->GetRecordBatchGenerator(
reader, row_groups,
column_projection,
cpu_executor,
rows_to_readahead));
+ // We need to skip casting the dictionary columns since the dataset_schema
doesn't
+ // have the dictionary-encoding information. Parquet reader will return
them with the
+ // dictionary type, which is what we eventually want.
+ const std::unordered_set<std::string>& dict_cols =
+ parquet_fragment->parquet_format_.reader_options.dict_columns;
+ // Casting before slicing is more efficient. Casts on slices might require
wasteful
+ // allocations and computation.
Review Comment:
When casting the string offsets buffer, we allocate `offset + length + 1`
elements, memset the first `offset` elements to zero, and perform the cast on
`length` many elements. So assuming batch_size = 1000, a row group of 10,000
elements would require allocating buffers of lengths 1001, 2001, ... 10,001 to
perform the casts on all batches.
https://github.com/apache/arrow/blob/6a2e19a852b367c72d7b12da4d104456491ed8b7/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L251
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]