Hi Tomas, Thanks a lot for taking another look, for rebasing the patches, and for the detailed high-level feedback.
> On Jun 22, 2026, at 23:33, Tomas Vondra <[email protected]> wrote: > > Hello Chengpeng, > > I finally had some time to look at this patch series again, and I wonder > what are your plans with it for the PG20 cycle. My goal for PG20 is still to make progress toward a practical GEQO replacement for large join problems. I would like to push this work toward something that can be merged, but I agree the final shape does not necessarily have to be the current GOO algorithm. If a LinDP-style approach turns out to be a better fit for postgres, I would be happy to explore that direction too. > FWIW attached is a v6 of the most patches, which is merely v5 rebased to > current master (the was a bit of bitrot). I haven't done any changes. I > have a couple minor comments, but but let me share some high level ideas > first about improvements to join search in general. Thanks for rebasing the series. One thing I am still working on is `pg_plan_advice` support for GOO. I think this is important because large join problems need a practical way to be tuned when the heuristic picks a bad plan, and GEQO has limitations in this area today. > I happened to be reading some recent papers about join order search, > because of a separate problem of replacing join_collapse_limit with > something less blunt (I'll start a separate thread about it). And there > seems to be a number of very promising approaches to join order search, > some of which seem like a better fit than the proposed GOO. > > Let me explain why I think so. > > For starters, consider this "join ordering" CMU lecture [1] from 2025. > That's Andy Pavlo's lecture, which generally are great overviews of the > particular field, including the new ides/approaches. > > At the beginning, it divided the queries by "size" (number of relations > joined, for simplicity), and it maps these cases to approaches: > > 1) small => solve optimally > > 2) medium => search space linearization > > 3) large => graceful greediness > > Our standard_join_search is clearly for the "small" case. I believe GOO > is a "graceful greediness" approach - it certainly does not do any > linearization, at least as I understand it. So it'd be good for "large" > problems (which I think is 1000+ tables). > > The way I understand this, this suggests there are approaches that can > do much better for "medium" queries with up to ~1000 tables (and that's > going to be ~99.999% of user queries). So by going to GOO right away, > we're probably failing to find much better plans. > > In particular, I find the papers about the "LinDP" family of algorithms > extremely intriguing: > > - Adaptive Optimization of Very Large Join Queries (Radke, Neumann), > 2018 [2] > > - LinDP++: Generalizing Linearized DP to Crossproducts and Non-Inner > Joins (Radke, Neumann), 2019 [3] > > - Optimizing Linearized Join Enumeration by Adapting to the Query > Structure (Birler, Stoian, Neumann), 2025 [4] > > Highly recommended papers, BTW. It was a fun weekend read. T. Neumann is > a guarantee of a great read. > > > So maybe we should try implementing one of these algorithms first? > > I don't actually think it'd be more code than the current GOO code, and > it seems to have a reasonably sound theory behind it. > > Of course, we may still need some "fallback" mode with heuristics, for > joins too large even for the LinDP algorithms. So maybe that's where GOO > would come into play? But do we have an idea how large joins it can > actually solve in practice? Maybe the LinDP algorithms are universally > better, even for large joins? I agree with this framing, and the LinDP family does look very promising. It may well be a better fit for the "medium" range than going directly from DP to a pure greedy algorithm. My reason for starting with GOO was mostly pragmatic. It is simple, direct, and small enough to keep the integration and review surface manageable. It also gives us a concrete baseline for the "graceful greediness" side of the design space, either as a possible fallback for very large joins or as a reference point for comparing more advanced algorithms. I do not yet have a strong answer for how large a join problem GOO can handle well in practice. That is one of the main things the benchmark work needs to answer. I also think getting the benchmark methodology right is important before comparing GOO with a more sophisticated algorithm. BTW, I also think constructing useful benchmark cases for this problem is itself difficult, and I am still thinking about that. I did skim through some of the papers you mentioned before, but it has been a while. I need to go back and read them more carefully before forming a stronger opinion. At first glance, my main question is not the algorithmic complexity itself, but how naturally LinDP maps onto postgres' current join-search framework. I expect much of the costing and join-legality machinery to remain useful. What I need to understand better is the enumeration side: LinDP is organized around a linearization of the join problem, while postgres's existing join search is mostly built around forming joinrels for relation subsets. Once the current GOO patch is in a more reviewable state, I would be interested in trying a LinDP-style implementation and comparing it against GOO/GEQO under the same benchmark setup. > Now, a couple comments about the v5 (or v6) patch. > > First, I generally like the code. I haven't done a deep thorough review, > but I find it quite readable. I also like how much smaller it is than > the GEQO code - it's 24kB vs. 172kB. Of course, that only matters if GOO > can do good enough job (compared to GEQO). > > While messing with the patch, I fond a relatively simple query that > fails to find a plan: > > > CREATE TABLE t1 (a int, b int); > CREATE TABLE t2 (a int, b int); > CREATE TABLE t3 (a int, b int); > CREATE TABLE t4 (a int, b int); > > SET enable_goo_join_search=on; > SET geqo_threshold = 2; > > SELECT 1 FROM > t2 B, t3 C, > LATERAL (SELECT t1.a, t1.b FROM t1 WHERE t1.b = C.a OFFSET 0) A, > LATERAL (SELECT t4.a, t4.b FROM t4 WHERE t4.b = B.a OFFSET 0) D > WHERE A.a = B.a AND C.a = D.a; > > ERROR: GOO join search failed: all strategies exhausted without a valid > join order > > > I'm not really sure what's causing this, but I see GEQO has this comment > about LATERAL joins: > > * In particular, we might clump together relations that actually > * mustn't be joined yet due to LATERAL restrictions; since there's no > * provision for un-clumping, this must lead to failure. > > Could this be causing issues for GOO? If it merges clumps it shouldn't > have merged, what happens then? Does it just fail? > > IMHO that's not a behavior we could accept. A practical join search > algorithm needs to be able to recover in some way, especially if it's > used as a fallback for joins that are "too large" for the DP case. Yes, I think this is the same class of problem. More generally, heuristic join search methods that commit clumps irreversibly can run into this kind of issue with the current infrastructure, although the exact triggering queries may differ. GOO already uses the existing per-candidate checks, so immediately illegal joins should be rejected. The missing part is that those checks do not prove that committing a locally valid clump still leaves a globally completable join problem. With `LATERAL`, a joinrel can be legally formed while its paths remain parameterized by rels outside the clump. If an irreversible heuristic commits the wrong clump, the remaining clumps may no longer have a legal completion. The possible fixes I see are: 1. Add a stronger candidate-safety check before committing a GOO clump. For `LATERAL`, that means not only asking whether the candidate itself is legal, but also whether committing it leaves the remaining clumps with a legal continuation. 2. Add limited backtracking, so that if the greedy path reaches a dead end, GOO can undo one or more committed clumps and try the next candidate. 3. Fall back to GEQO when GOO cannot complete a join order. This may be useful as a safety valve, and GEQO can produce a plan for this particular case. But I do not think it fully solves the underlying problem, because the GEQO comment points out a similar limitation of irreversible heuristic clumping. In the current patch I made this fail hard instead of silently falling back to GEQO for two reasons: first, to expose these GOO failure cases during review; and second, because if we eventually want another algorithm to replace GEQO, depending on GEQO as the long-term recovery path may not be the right maintenance model. My current inclination is toward the first option, because it tries to address the failure mode inside GOO instead of just hiding it behind a fallback. But I need to study whether such a check is feasible and how difficult it would be to implement. I am not yet sure whether a LinDP-style approach would naturally avoid this, and I need to study that more before making a stronger claim. -- Best regards, Chengpeng Yan
