westonpace opened a new pull request #11294:
URL: https://github.com/apache/arrow/pull/11294


   While scanning we do our best to readahead multiple files so we will read 
files 1, 2, 3, and 4 all at the same time.  This helps to maintain bandwidth 
when some files hit a snag (sometimes happens on AWS).  However, when doing an 
ordered scan, this can cause backpressure to explode when there is a slow 
consumer.
   
   The sequencer (placed at the end of the pipeline) can get into a situation 
where it pulls aggressively from files 2, 3, and 4 while waiting for the next 
chunk from file 1.  Since the sequencer is consuming the batches the 
backpressure mechanism thinks they are being consumed.  However, the actual 
consumer is leaving the batches piling up at the sequencer.
   
   This PR introduces one possible solution (and it may be the only possible 
solution) which is to sequence the batches at merge time (early in the 
pipeline).  The sequencer won't need to pull aggressively and backpressure will 
be maintained.  This pretty significantly reduces (but does not eliminate) the 
amount of file readahead we do in ordered scans.  We can worry about that if it 
ends up being a bottleneck at some point but for now I think it is better we do 
not explode RAM.
   
   This builds on ARROW-13611 and will remain in draft until that PR has merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to