LorrensP-2158466 commented on issue #17719:
URL: https://github.com/apache/datafusion/issues/17719#issuecomment-3401503363

   No problem at all! I was actually planning to experiment with the JoinGraph 
myself in the next couple of weeks (without the actual join ordering), so your 
PR should help speed things up!
   
   I’ll look at the PR this week, excluding the actual join ordering 
implementation for now. I’ll first need to read the paper to get a better grasp 
of that part.
   
   > QueryGraph relation to original LogicalPlan
   
   I expect DataFusion itself to handle the construction of the JoinGraph. 
That’s why I think it’s better to make generating the optimized plan the *easy* 
part of the process, this would provide a better user experience for the 
proposed API.
   
   > The QueryGraph decomposes the original LogicalPlan and stores parts of it.
   
   This ties into your second point. I ran into a very similar problem while 
implementing a HyperGraph for conjunctive queries (CQs), you can see it here: 
[AcyclicJoinNode](https://github.com/LorrensP-2158466/AcyclicJoinNode).
   
   A bit of context that might be useful: a research group at my university 
recently published a paper proposing a variant of the [Yannakakis Join 
Algorithm](https://en.wikipedia.org/wiki/Yannakakis_algorithm) called 
[**Shredded Yannakakis**](https://arxiv.org/abs/2411.04042).  
   During my bachelor’s thesis, I implemented the logical optimizer for this 
algorithm (which I’ll extend for my master’s thesis/project). To do that, I had 
to construct a HyperGraph of the join (CQ) to determine whether it admitted a 
JoinTree, from which the physical plan could be built.
   
   In that project, I faced a similar challenge to what you described. 
Fortunately, I didn’t have to deal with filter nodes between joins, none of our 
plans had that structure. For projections, I was allowed to skip patterns like 
`Join <- Projection <- Join`, since those weren’t supported or necessary for 
the algorithm.
   
   However, now I’ll need to support Projections**, Filters (selections), and 
other nodes between joins. My plan is to treat them as graph nodes.  
   For example, in a pattern like `Projection <- Join`, the node would contain 
this subplan but still provide easy access to the underlying join information.  
   Having a complete JoinTree, including intermediate nodes, would let me 
construct the physical shredded operators required for the Shredded Yannakakis 
algorithm (though I’m still figuring out if this approach will work).
   
   Now, back to the problem at hand.
   
   > It feels like it is easier to create the QueryGraph from a LogicalPlan if 
the LogicalPlan is in a certain shape.  
   > I was thinking that the construction is easiest if there aren't any other 
LogicalPlan nodes between the join nodes.
   
   Completely agree, that would be the ideal case, making JoinGraph 
construction much more straightforward.
   
   > I'm not sure this is entirely possible but it would be helpful if all 
filters & projections are either pushed down below the joins or pushed up 
before any join.
   
   In my case, that wouldn’t be a big problem. I think we can achieve a similar 
setup in DataFusion as well.  
   Projections and filters are relatively simple to reason about theoretically. 
That said, I think it’s definitely possible to expose a clean API that still 
supports LogicalPlans with filters and projections between joins. The only 
thing is that Filters will affect the overall cost, but I don't think that will 
hinder this
   
   I'll have to work out a design coming weeks on how to do this in my thesis, 
so I'll report back on anything useful!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to