2010YOUY01 commented on PR #12754: URL: https://github.com/apache/datafusion/pull/12754#issuecomment-2426446395
Really impressive work! 1. I suggest opening another PR for benchmarks only, it can get merged easily and also help attract more attention. 2. I have a question: (just skimmed through the duckdb [blog](https://duckdb.org/2022/05/27/iejoin.html), haven't fully understood the algorithm, please correct me if I'm wrong) My understandings on how IEJoin works: For query in above benchmark's example ```sql SELECT r.id, s.id FROM 'employees.parquet' r, 'employees.parquet' s WHERE r.salary < s.salary AND r.tax > s.tax ``` Conceptually it's executing in 3 steps: 1. Sort (r union s) on `salary`, add one column for `salary_rank` 2. Sort (r union s) again on `tax`, the `salary_rank` in step 1 gets permutated into _Permutation Array_ 3. Construct a bit array (same length as (r union s)), and do nest loop on it using _Permutation Array_ information, to find matches It's still _input_size^2_ complexity, however the _N^2_ step (step 3) is just looping through a bit array, it's way more efficient than do _N^2_ join condition evaluation if it's using NLJ, to make this implementation efficient. This PR only use datafusion's existing sort executer to do step 1, this can be parallelized to run on multiple partitions, and everything else is left inside new `IEJoin` executor, including union all two input tables, step 2 sorting, and looping over bit vector. Is it possible to extract the second sorting outside `IEJoin` executor, and let the existing sort executor to do this? This step then can be more easily parallelized, or does `IEJoin` executor have a better way to parallelize its job? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org