Re: [PR] GH-43660: [C++] Add a `CastingGenerator` to Parquet Reader that applies required casts before slicing [arrow]

via GitHub Thu, 29 Aug 2024 11:17:22 -0700


sahil1105 commented on code in PR #43661:
URL: https://github.com/apache/arrow/pull/43661#discussion_r1736873205



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -617,6 +684,9 @@ Result<RecordBatchGenerator> 
ParquetFileFormat::ScanBatchesAsync(
       [this, options, parquet_fragment, pre_filtered,
        row_groups](const std::shared_ptr<parquet::arrow::FileReader>& reader) 
mutable
       -> Result<RecordBatchGenerator> {
+    // Since we already do the batching through the SlicingGenerator, we don't 
need the
+    // reader to batch its output.
+    reader->set_batch_size(std::numeric_limits<int64_t>::max());
     // Ensure that parquet_fragment has FileMetaData

Review Comment:
   I think we have that problem regardless of the batch size of the reader. 
   We pass this reader to `reader->GetRecordBatchGenerator` (a few lines down). 
This eventually creates a `RowGroupGenerator` 
(https://github.com/apache/arrow/blob/6a2e19a852b367c72d7b12da4d104456491ed8b7/cpp/src/parquet/arrow/reader.cc#L1204).
 If we look at the implementation of `FetchNext` for `RowGroupGenerator`, it 
essentially calls `ReadOneRowGroup` which reads the entire row group into a 
table and then creates an output stream using `TableBatchReader` 
(https://github.com/apache/arrow/blob/6a2e19a852b367c72d7b12da4d104456491ed8b7/cpp/src/parquet/arrow/reader.cc#L1170).
 So, we're already reading an entire row group into a single table. The 
`batch_size` just creates a reader on top of it which generates zero-copy 
slices. This is the same as what the SlicingGenerator does and is redundant. In 
my opinion, it's better to set the batch size to INT_MAX so that it returns one 
table per row group, and then we can perform the batching through the 
SlicingGenerator. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-43660: [C++] Add a `CastingGenerator` to Parquet Reader that applies required casts before slicing [arrow]

Reply via email to