[GitHub] [arrow] westonpace commented on pull request #12228: ARROW-15410: [C++][Datasets] Improve memory usage of datasets API when scanning parquet

GitBox Fri, 22 Apr 2022 15:41:19 -0700


westonpace commented on PR #12228:
URL: https://github.com/apache/arrow/pull/12228#issuecomment-1106960830


   @lidavidm 
   
   Hmm....doing some more testing on this I think this might be non-ideal in a 
few situations (S3, low number of files, smallish row groups).  This is because 
we are always reading only 1 row group ahead so we will read, at most, 2 reads 
in parallel for a single file.
   
   However, the old behavior was also unmaintainable as it would have kicked 
off dozens of parallel reads and run out of memory.
   
   The ideal approach would be to keep track of how many rows we have "in 
flight" and issue reads until we have batch_size * batch_readahead rows in 
flight and then pause.  I'm going to work on this but, as we are close to 
release, I'd prefer to move forward with this 
sometimes-slower-but-usually-safer approach and put the ideal fix in a 
follow-up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #12228: ARROW-15410: [C++][Datasets] Improve memory usage of datasets API when scanning parquet

Reply via email to