Re: [PR] GH-43660: [C++] Add a `CastingGenerator` to Parquet Reader that applies required casts before slicing [arrow]

via GitHub Mon, 27 Oct 2025 08:17:59 -0700


scott-routledge2 commented on code in PR #43661:
URL: https://github.com/apache/arrow/pull/43661#discussion_r2466080647



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -627,6 +693,9 @@ Result<RecordBatchGenerator> 
ParquetFileFormat::ScanBatchesAsync(
       [this, options, parquet_fragment, pre_filtered,
        row_groups](const std::shared_ptr<parquet::arrow::FileReader>& reader) 
mutable
       -> Result<RecordBatchGenerator> {
+    // Since we already do the batching through the SlicingGenerator, we don't 
need the
+    // reader to batch its output.
+    reader->set_batch_size(std::numeric_limits<int64_t>::max());

Review Comment:
   Based on my understanding of this example, wouldn't calling 
`ScanBatchesAsync` create a 
[`RowGroupGenerator`](https://github.com/apache/arrow/blob/6a2e19a852b367c72d7b12da4d104456491ed8b7/cpp/src/parquet/arrow/reader.cc#L1204),
 which would then call 
[`ReadOneRowGroup`](https://github.com/apache/arrow/blob/6a2e19a852b367c72d7b12da4d104456491ed8b7/cpp/src/parquet/arrow/reader.cc#L1170)
 and read the entire row group into an in-memory table anyways? The batch size 
here seems to just control the chunk size of that table.
   
   The goal of this PR was to reduce the overhead of casting batches of string 
offset buffers (which would require O(N^2) memory/compute where N is the length 
of a row group) by performing the cast on the entire table/row-group at once, 
which we will have at some point because of `ReadOneRowGroup`, so I don't think 
this represents a regression. However, I understand wanting to keep the 
intended semantics of `ScanBatchesAsync` the same. Maybe instead we can perform 
the cast at a different layer of the stack e.g. at the reader level? 



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -627,6 +693,9 @@ Result<RecordBatchGenerator> 
ParquetFileFormat::ScanBatchesAsync(
       [this, options, parquet_fragment, pre_filtered,
        row_groups](const std::shared_ptr<parquet::arrow::FileReader>& reader) 
mutable
       -> Result<RecordBatchGenerator> {
+    // Since we already do the batching through the SlicingGenerator, we don't 
need the
+    // reader to batch its output.
+    reader->set_batch_size(std::numeric_limits<int64_t>::max());

Review Comment:
   Based on my understanding of this example, wouldn't calling 
`ScanBatchesAsync` create a 
[`RowGroupGenerator`](https://github.com/apache/arrow/blob/6a2e19a852b367c72d7b12da4d104456491ed8b7/cpp/src/parquet/arrow/reader.cc#L1204),
 which would then call 
[`ReadOneRowGroup`](https://github.com/apache/arrow/blob/6a2e19a852b367c72d7b12da4d104456491ed8b7/cpp/src/parquet/arrow/reader.cc#L1170)
 and read the entire row group into an in-memory table anyways? The batch size 
here seems to just control the chunk size of that table.
   
   The goal of this PR was to reduce the overhead of casting batches of string 
offset buffers (which would require O(N^2) memory/compute where N is the length 
of a row group) by performing the cast on the entire table/row-group at once, 
which we will have at some point because of `ReadOneRowGroup`, so I don't think 
this represents a regression. 
   
   However, I understand wanting to keep the intended semantics of 
`ScanBatchesAsync` the same. Maybe instead we can perform the cast at a 
different layer of the stack e.g. at the reader level? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-43660: [C++] Add a `CastingGenerator` to Parquet Reader that applies required casts before slicing [arrow]

Reply via email to