pitrou commented on code in PR #43661:
URL: https://github.com/apache/arrow/pull/43661#discussion_r2451510870
##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -627,6 +693,9 @@ Result<RecordBatchGenerator>
ParquetFileFormat::ScanBatchesAsync(
[this, options, parquet_fragment, pre_filtered,
row_groups](const std::shared_ptr<parquet::arrow::FileReader>& reader)
mutable
-> Result<RecordBatchGenerator> {
+ // Since we already do the batching through the SlicingGenerator, we don't
need the
+ // reader to batch its output.
+ reader->set_batch_size(std::numeric_limits<int64_t>::max());
Review Comment:
Well, if you have a very large row group (say 1 GB or more), what this will
do is read the entire file at once, then slice it, instead of generating
batches progressively.
So this sounds like a bad idea to me. The entire purpose of
`ScanBatchesAsync` is to generate batches progressively so that the consumer
can process them without waiting for the entire file to be read. This allows
covering IO latencies and also decreasing memory footprint (say you're
computing the average of a column: it would be wasteful to load the entire
column at once instead of having a running average computation over smaller
batches).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]