westonpace commented on PR #12228: URL: https://github.com/apache/arrow/pull/12228#issuecomment-1106960830
@lidavidm Hmm....doing some more testing on this I think this might be non-ideal in a few situations (S3, low number of files, smallish row groups). This is because we are always reading only 1 row group ahead so we will read, at most, 2 reads in parallel for a single file. However, the old behavior was also unmaintainable as it would have kicked off dozens of parallel reads and run out of memory. The ideal approach would be to keep track of how many rows we have "in flight" and issue reads until we have batch_size * batch_readahead rows in flight and then pause. I'm going to work on this but, as we are close to release, I'd prefer to move forward with this sometimes-slower-but-usually-safer approach and put the ideal fix in a follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org