alamb opened a new issue, #17719: URL: https://github.com/apache/datafusion/issues/17719
This is part of a larger discussion for improving the APIs for Join planning in DataFusion - https://github.com/apache/datafusion/issues/17718 Specifically, many more sophisticated join ordering algorithms rely on a data structure called a "JoinGraph" (defined below) In DataFusin terms, a JoinGraph could be formed from any tree of `LogicalPlan::Join` or corresponding PhysicalPlan Some ideas: ```rust /// represents a join graph struct JoinGraph { ... } impl JoinGraph { // make a new join graph from a plan // NOTE this maybe should be in terms of physical plans / `Arc<dyn ExecutionPlan>` fn new_from_plan(logical_plan: LogicalPlan) -> Result<Self> {..} ... // create a new plan from a join grap fn into_plan(self) -> Result<LogicalPlan> } ``` ### Join Graphs Quoting a pretty good [Google AI summary](https://www.google.com/search?q=joingraph+data+structure+database+definition+%28not+graph+database): A join graph in the context of relational databases is a data structure that represents the relationships and potential join operations between tables in a relational database schema. Here's a breakdown of its definition: * [Vertices (Nodes): ](https://www.google.com/search?sca_esv=3cba3ff7c6207a53&cs=1&q=Vertices+%28Nodes%29)Each vertex in a join graph corresponds to a table (or relation) in the relational database schema. * [Edges: ](https://www.google.com/search?sca_esv=3cba3ff7c6207a53)An edge between two vertices (tables) signifies a join relationship between those tables (e.g an equality predicate). * [Edge Labels (Join Conditions): ](https://www.google.com/search?sca_esv=3cba3ff7c6207a53&cs=1&q=Edge+Labels+%28Join+Conditions%29)Edges are labeled with the specific join conditions that define how the connected tables are to be joined. These conditions specify the attributes that must match for a successful join (e.g., TableA.ID = TableB.TableA_ID). Join graphs are used to: * [Plan Query Execution: ](https://www.google.com/search?sca_esv=3cba3ff7c6207a53&cs=1&q=Plan+Query+Execution)Database query optimizers can use join graphs to identify efficient ways to execute queries involving multiple joins. For example, TPCH Q3 is expressed in SQL like this: ```sql select l_orderkey, sum(l_extendedprice * (1 - l_discount)) as revenue, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = 'BUILDING' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10; ``` It can be represented with the following JoinGraph: ```mermaid graph TD customer["customer"] orders["orders"] lineitem["lineitem"] customer -->|c_custkey = o_custkey| orders orders -->|o_orderkey = l_orderkey| lineitem ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
