Re: [PR] GH-43660: [C++] Add a `CastingGenerator` to Parquet Reader that applies required casts before slicing [arrow]

via GitHub Wed, 22 Oct 2025 03:28:53 -0700


pitrou commented on code in PR #43661:
URL: https://github.com/apache/arrow/pull/43661#discussion_r2451530134



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -649,8 +718,17 @@ Result<RecordBatchGenerator> 
ParquetFileFormat::ScanBatchesAsync(
     ARROW_ASSIGN_OR_RAISE(auto generator, reader->GetRecordBatchGenerator(
                                               reader, row_groups, 
column_projection,
                                               cpu_executor, 
rows_to_readahead));
+    // We need to skip casting the dictionary columns since the dataset_schema 
doesn't
+    // have the dictionary-encoding information. Parquet reader will return 
them with the
+    // dictionary type, which is what we eventually want.
+    const std::unordered_set<std::string>& dict_cols =
+        parquet_fragment->parquet_format_.reader_options.dict_columns;
+    // Casting before slicing is more efficient. Casts on slices might require 
wasteful
+    // allocations and computation.

Review Comment:
   Can you explain which kind of wasteful allocations and computations would be 
required when casting a slice?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-43660: [C++] Add a `CastingGenerator` to Parquet Reader that applies required casts before slicing [arrow]

Reply via email to