comphead commented on PR #16660: URL: https://github.com/apache/datafusion/pull/16660#issuecomment-3184379239
@jonathanc-n please correct my understanding of PMJ join, its fairly new to me. > The PiecewiseMergeJoin is specifically designed for scenarios with only one range filter using operators like <, <=, >, and > >=. It achieves significant performance improvements by: > > Buffering one side: The right side (buffered) is loaded into memory and must be sorted > Streaming the other side: The left side (streamed) is processed incrementally and sorted during executions On a separate note would that possible to find a formula to calculate cost ? Reg to https://cs186berkeley.net/resources/static/notes/n09-Joins.pdf for SMJ it is ``` average I/O cost is: cost to sort R + cost to sort S + ([R] + [S]) (though it is important to note that this is not the worst case!). In the worst case, if each record of R matches every record of S, the last term becomes |R|∗[S]. The worst case cost is then: cost to sort R + cost to sort S + ([R] + |R|∗[S]). That generally doesn’t happen, though). ``` for simple case NLJ, without optimizations(no left prebuffering, lookup S on every row from R) ``` The I/O cost of this would then be [R]+|R|[S], where [R] is the number of pages in R and |R| is the number of records in R ``` Having a cost would give people more understanding the benefits of using PMJ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org