2010YOUY01 commented on PR #12754:
URL: https://github.com/apache/datafusion/pull/12754#issuecomment-2426446395

   Really impressive work!
   
   1. I suggest opening another PR for benchmarks only, it can get merged 
easily and also help attract more attention.
   
   2. I have a question: (just skimmed through the duckdb 
[blog](https://duckdb.org/2022/05/27/iejoin.html), haven't fully understood the 
algorithm, please correct me if I'm wrong)
   
   My understandings on how IEJoin works:
   For query in above benchmark's example
   ```sql
   SELECT r.id, s.id
        FROM 'employees.parquet' r, 'employees.parquet' s
        WHERE r.salary < s.salary AND r.tax > s.tax
   ```
   Conceptually it's executing in 3 steps:
   1. Sort (r union s) on `salary`, add one column for `salary_rank`
   2. Sort (r union s) again on `tax`, the `salary_rank` in step 1 gets 
permutated into _Permutation Array_
   3. Construct a bit array (same length as (r union s)), and do nest loop on 
it using _Permutation Array_ information, to find matches
   
   It's still _input_size^2_ complexity, however the _N^2_ step (step 3) is 
just looping through a bit array, it's way more efficient than do _N^2_ join 
condition evaluation if it's using NLJ, to make this implementation efficient.
   
   This PR only use datafusion's existing sort executer to do step 1, this can 
be parallelized to run on multiple partitions, and everything else is left 
inside new `IEJoin` executor, including union all two input tables, step 2 
sorting, and looping over bit vector.
   Is it possible to extract the second sorting outside `IEJoin` executor, and 
let the existing sort executor to do this? This step then can be more easily 
parallelized, or does `IEJoin` executor have a better way to parallelize its 
job?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to