comphead commented on PR #12082: URL: https://github.com/apache/datafusion/pull/12082#issuecomment-2336789738
> @comphead I've finally got it -- it's like in this case SMJ is trying to produce output for each join key pair (streamed-buffered) -- I guess it's how smj state managements works now -- streamed-side index won't move, until all buffered-side data will be processed, since it's required to identify current ordering. > > ``` > - get current join key from streamed_batch.join_arrays by self.streamed_batch.idx > - find all batches in buffered_data that contain the join key from step 1 > - if the buffered_data.scanning_batch_idx equals to batches length from step2 and this batch range.end == num_rows that probably means SMJ already emitted all the indices from this batch and we are done for the some particular key > ``` > > I'd say that normally you don't need to compare join keys, and you should rely on `buffered_data.scanning_finished()` (or `self.current_ordering == Less`), but in your example both of these conditions are either not working, or not intended to work (not sure which of these two is a correct statement). > > I also hope to start spending some time on SMJ due to #12359 Thanks @korowa I have been experimenting so much with different parts of SMJ and it showed that `buffered_data.scanning_finished()` is not working, `self.current_ordering == Less` we cannot rely on this in `freeze_streamed` as it is called only if `self.current_ordering == Equal`. Now I'm trying to calculate if its possible to predict that ordering gonna change from `Equal` to `Less`. And yes I was also trying to compare join arrays which potentially can give us a clue that everything is processed, but it might be very expensive -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org