pantShrey commented on PR #21882: URL: https://github.com/apache/datafusion/pull/21882#issuecomment-4415353823
@alamb Thank you so much for the review! I scoped out the SortMergeJoin migration today, specifically looking at bitwise_stream.rs and process_key_match_with_filter, to see what it would take. Because SortMergeJoin currently reads from the spill file via a synchronous for loop inside a hand-rolled poll state machine, making the read path truly async requires a major rewrite. We can't just .await the stream, so we may need to store the SendableRecordBatchStream in the execution state and manually persist variables like matched_count across Poll::Pending yields. Because ParadeDB is hoping to unblock their Postgres integration next week, I'm worried a state machine rewrite of this scale will stall them. Would you be open to merging this core abstraction first (with open_sync_reader marked as #[deprecated])? I can open a dedicated tracking issue for the SortMergeJoin async migration and tackle it as a fast follow-up PR. I am happy to defer to your judgment if you feel the tech debt must be addressed first! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
