comphead commented on PR #12082:
URL: https://github.com/apache/datafusion/pull/12082#issuecomment-2336789738

   > @comphead I've finally got it -- it's like in this case SMJ is trying to 
produce output for each join key pair (streamed-buffered) -- I guess it's how 
smj state managements works now -- streamed-side index won't move, until all 
buffered-side data will be processed, since it's required to identify current 
ordering.
   > 
   > ```
   > - get current join key from streamed_batch.join_arrays by 
self.streamed_batch.idx
   > - find all batches in buffered_data that contain the join key from step 1
   > - if the buffered_data.scanning_batch_idx equals to batches length from 
step2 and this batch range.end == num_rows that probably means SMJ already 
emitted all the indices from this batch and we are done for the some particular 
key
   > ```
   > 
   > I'd say that normally you don't need to compare join keys, and you should 
rely on `buffered_data.scanning_finished()` (or `self.current_ordering == 
Less`), but in your example both of these conditions are either not working, or 
not intended to work (not sure which of these two is a correct statement).
   > 
   > I also hope to start spending some time on SMJ due to #12359
   
   Thanks @korowa I have been experimenting so much with different parts of SMJ 
and it showed that
   `buffered_data.scanning_finished()` is not working, 
   `self.current_ordering == Less` we cannot rely on this in `freeze_streamed` 
as it is called only if `self.current_ordering == Equal`. Now I'm trying to 
calculate if its possible to predict that ordering gonna change from  `Equal` 
to `Less`. 
   
   And yes I was also trying to compare join arrays which potentially can give 
us a clue that everything is processed, but it might be very expensive


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to