korowa commented on PR #12082: URL: https://github.com/apache/datafusion/pull/12082#issuecomment-2336680062
@comphead I've finally got it -- it's like in this case SMJ is trying to produce output for each join key pair (streamed-buffered) -- I guess it's how smj state managements works now -- streamed-side index won't move, until all buffered-side data will be processed, since it's required to identify current ordering. ``` - get current join key from streamed_batch.join_arrays by self.streamed_batch.idx - find all batches in buffered_data that contain the join key from step 1 - if the buffered_data.scanning_batch_idx equals to batches length from step2 and this batch range.end == num_rows that probably means SMJ already emitted all the indices from this batch and we are done for the some particular key ``` I'd say that normally you don't need to compare join keys, and you should rely on `buffered_data.scanning_finished()` (or `self.current_ordering == Less`), but in your example both of these conditions are either not working, or not intended to work (not sure which of these two is a correct statement). I also hope to start spending some time on SMJ due to https://github.com/apache/datafusion/issues/12359 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org