comphead commented on code in PR #13469:
URL: https://github.com/apache/datafusion/pull/13469#discussion_r1847120705


##########
datafusion/physical-plan/src/joins/sort_merge_join.rs:
##########
@@ -68,8 +68,43 @@ use crate::{
     RecordBatchStream, SendableRecordBatchStream, Statistics,
 };
 
-/// join execution plan executes partitions in parallel and combines them into 
a set of
-/// partitions.
+/// Join execution plan that executes equi-join predicates on multiple 
partitions using Sort-Merge
+/// join algorithm and applies an optional filter post join. Can be used to 
join arbitrarily large
+/// inputs where one or both of the inputs don't fit in the available memory.
+///  
+/// # Join Expressions
+///
+/// Equi-join predicate (e.g. `<col1> = <col2>`) expressions are represented 
by [`Self::on`].
+///
+/// Non-equality predicates, which can not be pushed down to join inputs (e.g.
+/// `<col1> != <col2>`) are known as "filter expressions" and are evaluated
+/// after the equijoin predicates. They are represented by [`Self::filter`]. 
These are optional
+/// expressions.
+///
+/// # Sorting
+///
+/// Assumes that both the left and right input to the join are pre-sorted. It 
is not the
+/// responisibility of this execution plan to sort the inputs.
+///
+/// # "Streamed" vs "Buffered"
+///
+/// Buffered input is buffered for all record batches having the same value of 
join key.
+/// If the memory limit increases beyond the specified value and spilling is 
enabled,
+/// buffered batches could be spilled to disk. If spilling is disabled, the 
execution
+/// will fail under the same conditions. Multiple record batches of buffered 
could be
+/// present in memory/disk during the exectution.
+///
+/// Only one record batch of streamed input will be present in the memory at 
all times. There is no
+/// spilling support for streamed input. The comparisons are performed from 
values of join keys in
+/// streamed input with the values of join keys in buffered input. One row in 
streamed record
+/// batch could be matched with multiple rows in buffered input batches.
+///
+/// Depending on the type of join left or right input may be selected as 
streamed or buffered
+/// respectively. For example, in a left-outer join, the left execution plan 
will be selected as
+/// streamed input.
+///
+/// Reference for the algorithm:
+/// <https://en.wikipedia.org/wiki/Sort-merge_join>

Review Comment:
   ```suggestion
   /// <https://en.wikipedia.org/wiki/Sort-merge_join>
   ///
   /// Helpful short video demonstration
   /// https://www.youtube.com/watch?v=jiWCPJtDE2c
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to