pepijnve commented on PR #16322: URL: https://github.com/apache/datafusion/pull/16322#issuecomment-2958304164
Waiting for benchmarks results here so I have some time to write up my assessment of what was happening and what has changed. This is just to assist any reviewers, not to replace review or justify the changes. #### Initial test failure The congestion test set up three streams to merge: 1. returns Ready(None) and panics if polled again. This tries to check that None is respected as terminal result 2. return Pending until stream 3 has been polled, then always returns Ready(None) 3. clears congestion for 2 and always returns Ready(None) After the initial change the congestion test started failing. The poll sequence I observed was 1. stream1.poll_next -> None 2. stream2.poll_next -> Pending 3. stream3.poll_next -> Ready(None) and then the test hung. This turned out to be caused by CongestedStream returning pending, but not holding on to the waker to wake it when the congestion clears. The initial phase of SPM was now assuming it would be woken up by any stream that returned pending and that wasn't happening. #### CongestedStream fix I don't believe it's valid for a Stream to return pending but not set up waking, so I fixed that and then got this poll sequence. 1. stream1.poll_next -> None 2. stream2.poll_next -> Pending 3. stream3.poll_next -> Ready(None) 4. stream2.poll_next -> Ready(None) and the test passed. #### Using swap_remove Based on initial review feedback I restored the usage of swap_remove which results in the following sequence. 1. stream1.poll_next -> None 2. stream3.poll_next -> Ready(None) 3. stream2.poll_next -> Ready(None) Due to swap_remove changing the poll order, stream 3 was getting polled before stream 2. As a consequence we never hit the congestion situation and the test case passes, but no longer tests what it says it does. #### Testing congestion again I adapted CongestedStream to keep track of the set of streams that have been polled. Both stream 2 and 3 are now treated as congested unless all streams have been polled. This results in the following sequence. 1. stream1.poll_next -> None 2. stream3.poll_next -> Pending 3. stream2.poll_next -> Ready(None) 4. stream3.poll_next -> Ready(None) The test case now hits the congestion point again. As a bonus the test is no longer dependent on the exact order in which the streams are polled which decouples it from the implementation details a bit more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org