westonpace opened a new pull request #11285:
URL: https://github.com/apache/arrow/pull/11285


   This PR adds backpressure back into the asynchronous scanner.  It creates an 
AsyncToggle which can be shared between the push-based sink and the pull-based 
scanner.  The sink will close the toggle when it's buffer fills up and the 
scanner will pause delivering items when the toggle is closed.
   
   This PR adds the feature in a way that bypasses the exec plan's backpressure 
mechanisms as those have not been fully fleshed out and I still am not sure 
what direction we are planning to go with that.  Instead the back pressure is 
almost completely handled outside of the compute space.
   
   I've got the same mechanism working for dataset writes but I don't want to 
hold up this PR while I wait for the write node to merge so I have created 
ARROW-14191 to follow that work.
   
   Currently backpressure is broken for ordered scans.  It turns out this has 
always been the case for the asynchronous scanner, even before it moved to the 
exec plan.  The root cause is that the merge generator will keep reading from 
files 2-N if the read on file 1 is slow.  I have created a test case which 
demonstrates this but will defer fixing this for ARROW-14192
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to