Hi Henson, On Sat, May 9, 2026 at 10:52 PM Henson Choi <[email protected]> wrote: > > Hi Junwang, > > Thanks for an interesting thread, and for picking this up — it has > been a useful occasion to re-examine assumptions I had carried over > from a different graph model. > > My background is on AgensGraph, where labels form an inheritance > hierarchy (rather than Neo4j-style multi-label tag sets), so a path > pattern rewrites against a single parent table per position and the > fan-out across labels is left to the planner's existing inheritance > machinery. The rewriter never enumerates per-element combinations, so > an N^K blow-up like Satya's was simply not on the radar there.
Thanks for the input, good to know. > > SQL/PGQ instead designates graph identity through the schema itself, > so each pattern position carries a set of candidate element tables > and the rewriter has to materialize each schema-feasible join > combination — a (vertex, edge, vertex, ...) sequence of element > tables — as its own SELECT branch, with all such branches joined by > UNION ALL. The number of *feasible* join combinations is bounded by > the schema graph's connectivity and is usually small; the current > rewriter's naive shape, however, is to enumerate the full N^K > candidate space first and reject infeasible combinations only > afterwards. The patch tackles exactly that part. > > To be clear, I do not mean this as a criticism of the original > implementer. For sizeable features I tend to prefer a deliberately > naive implementation first — getting the full functionality working > end-to-end before optimizing — and that approach is even more > valuable in team settings, since a working baseline lets additional > contributors join much earlier. The current shape of > generate_queries_for_path_pattern_recurse() reads exactly like such a > baseline, and this patch is precisely the kind of focused follow-up > improvement it invites. > > When I thought about this independently, I had reached for the dual > formulation — "extend the prefix only along feasible edges" rather > than "enumerate everything and prune" — and my intuition was that the > two should produce the same surviving set, though I have not worked > that through rigorously. Either way, your "early pruning" framing is > the better one for the patch itself: it stays close to the existing > code structure and lets the change read as a small refinement of > generate_queries_for_path_pattern_recurse() rather than a > reorganization of how candidates are walked. That makes review and > incremental landing materially easier. > > The same reframing also explains, after the fact, why the discrepancy > between Ashutosh's quick runs and Satya's 81 GB case was so large: > the cost being measured is in large part infeasible enumeration, > which a forward-moved check eliminates regardless of pattern length > or schema size in realistic graphs. > > I have not gone through the patch at the code level yet, but the > shape of the problem and of your fix is clear enough from the > description. My personal take, with that caveat, is that early > pruning, together with the CHECK_FOR_INTERRUPTS that Satya already > proposed in the earlier thread [1], feels like the right pair to > land first; the explicit hard limit, in turn, is probably best > revisited only if a real case turns up where even those two combined > are not enough. That keeps the immediate change narrow and > semantics-preserving while leaving the policy question open for > evidence rather than upfront estimation. Agreed, I add a CHECK_FOR_INTERRUPTS to the patch now, see the attached. > > There is a fairly recent precedent in PostgreSQL itself that I think > supports this ordering. I have been following — and participating in > — Tatsuo Ishii's Row Pattern Recognition thread [2] since earlier > this year, and that work ran into a structurally similar concern: in > its earlier implementation the engine could undergo a combinatorial > explosion of pattern-candidate states. A memory-based limit was > floated as one mitigation, but it never actually converged into a > design — the conversation around an explicit limit simply faded out > as a layered set of pruning and early-termination rules accumulated > incrementally and brought the state space back under control. CFI > and stack-depth checks, by contrast, were treated as independently > necessary throughout and stayed in the design. > > The shape of this thread feels analogous, and the same staged > approach (structural fix + CFI first, hard limit only if a residual > case genuinely demands it) is, I suspect, what will hold up > here as well. Thanks for the insightful analysis, it's extremely helpful. > > I'll set aside time separately to do a proper line-by-line review and > follow up. > > Regards, > Henson > > [1] > https://www.postgresql.org/message-id/CAHg%2BQDe8JU%2BERqA2xwjrg_ZptvH_v0T6PS9_P_ZgyYzD5h-Grw%40mail.gmail.com > [2] > https://www.postgresql.org/message-id/flat/20230625.210509.1276733411677577841.t-ishii%40sranhm.sra.co.jp -- Regards Junwang Zhao
v1-0001-Prune-non-matching-graph-path-prefixes-during-DFS.patch
Description: Binary data
