athultr1997 commented on code in PR #13469:
URL: https://github.com/apache/datafusion/pull/13469#discussion_r1847572086


##########
datafusion/physical-plan/src/joins/sort_merge_join.rs:
##########
@@ -68,8 +68,46 @@ use crate::{
     RecordBatchStream, SendableRecordBatchStream, Statistics,
 };
 
-/// join execution plan executes partitions in parallel and combines them into 
a set of
-/// partitions.
+/// Join execution plan that executes equi-join predicates on multiple 
partitions using Sort-Merge
+/// join algorithm and applies an optional filter post join. Can be used to 
join arbitrarily large
+/// inputs where one or both of the inputs don't fit in the available memory.
+///  
+/// # Join Expressions
+///
+/// Equi-join predicate (e.g. `<col1> = <col2>`) expressions are represented 
by [`Self::on`].
+///
+/// Non-equality predicates, which can not be pushed down to join inputs (e.g.
+/// `<col1> != <col2>`) are known as "filter expressions" and are evaluated
+/// after the equijoin predicates. They are represented by [`Self::filter`]. 
These are optional
+/// expressions.
+///
+/// # Sorting
+///
+/// Assumes that both the left and right input to the join are pre-sorted. It 
is not the
+/// responisibility of this execution plan to sort the inputs.
+///
+/// # "Streamed" vs "Buffered"
+///
+/// Buffered input is buffered for all record batches having the same value of 
join key.
+/// If the memory limit increases beyond the specified value and spilling is 
enabled,
+/// buffered batches could be spilled to disk. If spilling is disabled, the 
execution

Review Comment:
   One has to enable the disk manager when creating the `RuntimeEnv` during the 
creation of `TaskContext`.
   ```
   let runtime = RuntimeEnvBuilder::new()
               .with_memory_limit(100, 1.0)
               .with_disk_manager(DiskManagerConfig::NewOs)
               .build_arc()?;
   ```
   I am not sure if we should mention it here, since this is all the way over 
in `RuntimeEnv` and there are multiple strategies to enable the disk manager.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to