LorrensP-2158466 commented on issue #17719: URL: https://github.com/apache/datafusion/issues/17719#issuecomment-3401503363
No problem at all! I was actually planning to experiment with the JoinGraph myself in the next couple of weeks (without the actual join ordering), so your PR should help speed things up! I’ll look at the PR this week, excluding the actual join ordering implementation for now. I’ll first need to read the paper to get a better grasp of that part. > QueryGraph relation to original LogicalPlan I expect DataFusion itself to handle the construction of the JoinGraph. That’s why I think it’s better to make generating the optimized plan the *easy* part of the process, this would provide a better user experience for the proposed API. > The QueryGraph decomposes the original LogicalPlan and stores parts of it. This ties into your second point. I ran into a very similar problem while implementing a HyperGraph for conjunctive queries (CQs), you can see it here: [AcyclicJoinNode](https://github.com/LorrensP-2158466/AcyclicJoinNode). A bit of context that might be useful: a research group at my university recently published a paper proposing a variant of the [Yannakakis Join Algorithm](https://en.wikipedia.org/wiki/Yannakakis_algorithm) called [**Shredded Yannakakis**](https://arxiv.org/abs/2411.04042). During my bachelor’s thesis, I implemented the logical optimizer for this algorithm (which I’ll extend for my master’s thesis/project). To do that, I had to construct a HyperGraph of the join (CQ) to determine whether it admitted a JoinTree, from which the physical plan could be built. In that project, I faced a similar challenge to what you described. Fortunately, I didn’t have to deal with filter nodes between joins, none of our plans had that structure. For projections, I was allowed to skip patterns like `Join <- Projection <- Join`, since those weren’t supported or necessary for the algorithm. However, now I’ll need to support Projections**, Filters (selections), and other nodes between joins. My plan is to treat them as graph nodes. For example, in a pattern like `Projection <- Join`, the node would contain this subplan but still provide easy access to the underlying join information. Having a complete JoinTree, including intermediate nodes, would let me construct the physical shredded operators required for the Shredded Yannakakis algorithm (though I’m still figuring out if this approach will work). Now, back to the problem at hand. > It feels like it is easier to create the QueryGraph from a LogicalPlan if the LogicalPlan is in a certain shape. > I was thinking that the construction is easiest if there aren't any other LogicalPlan nodes between the join nodes. Completely agree, that would be the ideal case, making JoinGraph construction much more straightforward. > I'm not sure this is entirely possible but it would be helpful if all filters & projections are either pushed down below the joins or pushed up before any join. In my case, that wouldn’t be a big problem. I think we can achieve a similar setup in DataFusion as well. Projections and filters are relatively simple to reason about theoretically. That said, I think it’s definitely possible to expose a clean API that still supports LogicalPlans with filters and projections between joins. The only thing is that Filters will affect the overall cost, but I don't think that will hinder this I'll have to work out a design coming weeks on how to do this in my thesis, so I'll report back on anything useful! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
