[I] [DISCUSSION] JOIN "task force" / project team [datafusion]

via GitHub Mon, 28 Apr 2025 14:52:54 -0700


alamb opened a new issue, #15885:
URL: https://github.com/apache/datafusion/issues/15885


   # What I see (what problem we are trying to solve)
   DataFusion's current join implementations are fairly basic. They are 
functional enough to run TPCH and TPC-DS, but lack other features such as 
larger-than-memory processing, ASOF joins, complete subquery support and more. 
   
   There seems to be a non trivial desire in the community to improve this. 
   
   Some examples of issues / tickets related to enhanced join support / 
features:
   
   ## Subqueries (which are implemented as joins)
   - [ ] https://github.com/apache/datafusion/issues/5483
   - [ ] https://github.com/apache/datafusion/issues/5492
   - [ ] https://github.com/apache/datafusion/issues/14554
   
   ## Join Features
   - [ ] https://github.com/apache/datafusion/issues/12454
   - [ ] https://github.com/apache/datafusion/issues/15784
   - [ ] https://github.com/apache/datafusion/issues/14239
   - [ ] https://github.com/apache/datafusion/issues/14238
   - [ ] https://github.com/apache/datafusion/issues/13765
   - [ ] https://github.com/apache/datafusion/issues/13003
   - [ ] https://github.com/apache/datafusion/issues/12952
   - [ ] https://github.com/apache/datafusion/issues/10048
   
   ## Specialized Joins
   - [ ] https://github.com/apache/datafusion/issues/9846
   - [ ] https://github.com/apache/datafusion/issues/318
   - [ ] https://github.com/apache/datafusion/issues/13471
   - [ ] https://github.com/apache/datafusion/issues/13232
   - [ ] https://github.com/apache/datafusion/issues/13181
   - [ ] https://github.com/apache/datafusion/issues/13138
   
   ## Performance
   - [ ] https://github.com/apache/datafusion/issues/15382
   - [ ] https://github.com/apache/datafusion/issues/7955
   - [ ] https://github.com/apache/datafusion/issues/14758
   - [ ] https://github.com/apache/datafusion/issues/13620
   
   # What is blocking significant forward progress
   In my mind, the major challenge is that "improving" `JOIN`s can get 
arbitrarily complicated. There are dozens of academic paper each year on 
various aspects of join implemnetations, and designing / implementing join 
capabilities is a substantial engineering effort. 
   
   I spent 6 years of my life doing joins at Vertica where they accounted for 
around 50% of the optimizer's complexity, to give some sense
   
   I don't think the issue is that any particular feature is super complicated 
to understand, but defining the overall goal, the framework that will 
accomodate the goal, and then breaking it down into implementable pieces itself 
I think will require both specialized knowledge and substantial time. 
   
   
   ## What I suggest
   
   I suggest that people with the relevant skills and time to invest gather 
together to drive this process worward
   1. plan out a "join roadmap" (aka prioritize what join features they will 
push forward)
   2. Figure out what, if any, new structures are in place
   3. Start breaking it down into smaller tickets
   I can't personally lead such an effort, but I am filing this ticket to try 
and help connect the relevant people in the community that can. 
   
   Some potential people that could help (sorry if I didn't list you)
   * @duongcongtoai -- the discussion on 
https://github.com/apache/datafusion/issues/14554#issuecomment-2798943345
   * @xudong963  who has experience in this area
   * @Dandandan @comphead and @korowa  who contributed substantially to the 
existing joins
   * @mingmwang and @jackwener  who contributed significantly to the original 
subquery implementation
   * @liukun4515 who likewise helped signifcantly
   
   ## Related content:
   Related blogs (join ordering section in part 2):  
https://www.influxdata.com/blog/optimizing-sql-dataframes-part-two/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] [DISCUSSION] JOIN "task force" / project team [datafusion]

Reply via email to