pitrou commented on code in PR #43661:
URL: https://github.com/apache/arrow/pull/43661#discussion_r2451510870


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -627,6 +693,9 @@ Result<RecordBatchGenerator> 
ParquetFileFormat::ScanBatchesAsync(
       [this, options, parquet_fragment, pre_filtered,
        row_groups](const std::shared_ptr<parquet::arrow::FileReader>& reader) 
mutable
       -> Result<RecordBatchGenerator> {
+    // Since we already do the batching through the SlicingGenerator, we don't 
need the
+    // reader to batch its output.
+    reader->set_batch_size(std::numeric_limits<int64_t>::max());

Review Comment:
   Well, if you have a very large row group (say 1 GB or more), what this will 
do is read the entire file at once, then slice it, instead of generating 
batches progressively.
   
   So this sounds like a bad idea to me. The entire purpose of 
`ScanBatchesAsync` is to generate batches progressively so that the consumer 
can process them without waiting for the entire file to be read. This allows 
covering IO latencies and also decreasing memory footprint (say you're 
computing the average of a column: it would be wasteful to load the entire 
column at once instead of having a running average computation over smaller 
batches).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to