scott-routledge2 commented on code in PR #43661:
URL: https://github.com/apache/arrow/pull/43661#discussion_r2466080647
##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -627,6 +693,9 @@ Result<RecordBatchGenerator>
ParquetFileFormat::ScanBatchesAsync(
[this, options, parquet_fragment, pre_filtered,
row_groups](const std::shared_ptr<parquet::arrow::FileReader>& reader)
mutable
-> Result<RecordBatchGenerator> {
+ // Since we already do the batching through the SlicingGenerator, we don't
need the
+ // reader to batch its output.
+ reader->set_batch_size(std::numeric_limits<int64_t>::max());
Review Comment:
Based on my understanding of this example, wouldn't calling
`ScanBatchesAsync` create a
[`RowGroupGenerator`](https://github.com/apache/arrow/blob/6a2e19a852b367c72d7b12da4d104456491ed8b7/cpp/src/parquet/arrow/reader.cc#L1204),
which would then call
[`ReadOneRowGroup`](https://github.com/apache/arrow/blob/6a2e19a852b367c72d7b12da4d104456491ed8b7/cpp/src/parquet/arrow/reader.cc#L1170)
and read the entire row group into an in-memory table anyways? The batch size
here seems to just control the chunk size of that table.
The goal of this PR was to reduce the overhead of casting batches of string
offset buffers (which would require O(N^2) memory/compute where N is the length
of a row group) by performing the cast on the entire table/row-group at once,
which we will have at some point because of `ReadOneRowGroup`, so I don't think
this represents a regression. However, I understand wanting to keep the
intended semantics of `ScanBatchesAsync` the same. Maybe instead we can perform
the cast at a different layer of the stack e.g. at the reader level?
##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -627,6 +693,9 @@ Result<RecordBatchGenerator>
ParquetFileFormat::ScanBatchesAsync(
[this, options, parquet_fragment, pre_filtered,
row_groups](const std::shared_ptr<parquet::arrow::FileReader>& reader)
mutable
-> Result<RecordBatchGenerator> {
+ // Since we already do the batching through the SlicingGenerator, we don't
need the
+ // reader to batch its output.
+ reader->set_batch_size(std::numeric_limits<int64_t>::max());
Review Comment:
Based on my understanding of this example, wouldn't calling
`ScanBatchesAsync` create a
[`RowGroupGenerator`](https://github.com/apache/arrow/blob/6a2e19a852b367c72d7b12da4d104456491ed8b7/cpp/src/parquet/arrow/reader.cc#L1204),
which would then call
[`ReadOneRowGroup`](https://github.com/apache/arrow/blob/6a2e19a852b367c72d7b12da4d104456491ed8b7/cpp/src/parquet/arrow/reader.cc#L1170)
and read the entire row group into an in-memory table anyways? The batch size
here seems to just control the chunk size of that table.
The goal of this PR was to reduce the overhead of casting batches of string
offset buffers (which would require O(N^2) memory/compute where N is the length
of a row group) by performing the cast on the entire table/row-group at once,
which we will have at some point because of `ReadOneRowGroup`, so I don't think
this represents a regression.
However, I understand wanting to keep the intended semantics of
`ScanBatchesAsync` the same. Maybe instead we can perform the cast at a
different layer of the stack e.g. at the reader level?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]