pitrou commented on code in PR #43661:
URL: https://github.com/apache/arrow/pull/43661#discussion_r2451530134
##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -649,8 +718,17 @@ Result<RecordBatchGenerator>
ParquetFileFormat::ScanBatchesAsync(
ARROW_ASSIGN_OR_RAISE(auto generator, reader->GetRecordBatchGenerator(
reader, row_groups,
column_projection,
cpu_executor,
rows_to_readahead));
+ // We need to skip casting the dictionary columns since the dataset_schema
doesn't
+ // have the dictionary-encoding information. Parquet reader will return
them with the
+ // dictionary type, which is what we eventually want.
+ const std::unordered_set<std::string>& dict_cols =
+ parquet_fragment->parquet_format_.reader_options.dict_columns;
+ // Casting before slicing is more efficient. Casts on slices might require
wasteful
+ // allocations and computation.
Review Comment:
Can you explain which kind of wasteful allocations and computations would be
required when casting a slice?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]