westonpace opened a new pull request #11285: URL: https://github.com/apache/arrow/pull/11285
This PR adds backpressure back into the asynchronous scanner. It creates an AsyncToggle which can be shared between the push-based sink and the pull-based scanner. The sink will close the toggle when it's buffer fills up and the scanner will pause delivering items when the toggle is closed. This PR adds the feature in a way that bypasses the exec plan's backpressure mechanisms as those have not been fully fleshed out and I still am not sure what direction we are planning to go with that. Instead the back pressure is almost completely handled outside of the compute space. I've got the same mechanism working for dataset writes but I don't want to hold up this PR while I wait for the write node to merge so I have created ARROW-14191 to follow that work. Currently backpressure is broken for ordered scans. It turns out this has always been the case for the asynchronous scanner, even before it moved to the exec plan. The root cause is that the merge generator will keep reading from files 2-N if the read on file 1 is slow. I have created a test case which demonstrates this but will defer fixing this for ARROW-14192 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
