comphead commented on PR #16660:
URL: https://github.com/apache/datafusion/pull/16660#issuecomment-3184379239

   @jonathanc-n please correct my understanding of PMJ join, its fairly new to 
me. 
   
   > The PiecewiseMergeJoin is specifically designed for scenarios with only 
one range filter using operators like <, <=, >, and > >=. It achieves 
significant performance improvements by:
   > 
   > Buffering one side: The right side (buffered) is loaded into memory and 
must be sorted
   > Streaming the other side: The left side (streamed) is processed 
incrementally and sorted during executions
   
   On a separate note would that possible to find a formula to calculate cost ? 
Reg to https://cs186berkeley.net/resources/static/notes/n09-Joins.pdf
   for SMJ it is 
   ```
   average I/O cost is: cost to sort R + cost to sort S +
   ([R] + [S]) (though it is important to note that this is not the worst 
case!). In the worst case, if
   each record of R matches every record of S, the last term becomes |R|∗[S]. 
The worst case cost is
   then: cost to sort R + cost to sort S + ([R] + |R|∗[S]). That generally 
doesn’t happen, though).
   ```
   
   for simple case NLJ, without optimizations(no left prebuffering, lookup S on 
every row from R)
   
   ```
   The I/O cost of this would then be [R]+|R|[S],
   where [R] is the number of pages in R and |R| is the number of records in R
   ```
   
   Having a cost would give people more understanding the benefits of using PMJ


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to