jrgemignani commented on issue #2349: URL: https://github.com/apache/age/issues/2349#issuecomment-3999715962
@gregfelice The following is from Opus, because it can explain it better than I, as it has been a while since I wrote the VLE. ## Analysis of Issue #2349: VLE path expansion cycle detection claims ### Summary Issue #2349 claims that AGE's VLE (variable-length edge) expansion has O(n^k) complexity with "no cycle detection" and proposes adding a visited-node set. After a thorough review of the entire VLE implementation (`src/backend/utils/adt/age_vle.c`, 2692 lines), **the claims are substantially incorrect**. AGE's VLE already implements cycle prevention via edge-uniqueness enforcement — consistent with the openCypher specification. The performance concern is real but misdiagnosed. --- ### How VLE Actually Works The VLE implementation uses a **DFS (depth-first search) with per-path edge-uniqueness enforcement**, not unchecked recursive expansion. **Core data structures** (age_vle.c lines 42–97): - `edge_state_hashtable` — per-edge state tracking (100K initial hash buckets) - `edge_state_entry.used_in_path` — boolean flag marking edges currently in the active DFS path - `dfs_path_stack` — the current path being explored - `dfs_edge_stack` — working stack of candidate edges - `dfs_vertex_stack` — parent vertex tracking (used only for undirected traversal) **Cycle prevention operates at two levels:** 1. **In `add_valid_vertex_edges()`** (lines 1206–1367): Before adding an edge to the exploration stack, the code checks `ese->used_in_path`. If the edge is already in the current path, it is skipped. For small paths (<10 edges), a fast linear scan via `is_edge_in_path()` is used before falling through to the hash lookup. 2. **In `dfs_find_a_path_between()` and `dfs_find_a_path_from()`** (lines 933–1176): When an edge is popped that has `used_in_path == true`, the algorithm recognizes it is **backtracking** — it pops the edge from the path stack, resets `used_in_path = false`, and continues to the next candidate. This is textbook DFS backtracking with edge-uniqueness. 3. **Upper bound enforcement**: Both DFS functions check `get_stack_size(path_stack) < vlelctx->uidx` before expanding deeper. A pattern like `[*1..3]` is strictly bounded at depth 3. --- ### Point-by-Point Response to Issue Claims | Claim | Verdict | Explanation | |---|---|---| | "No cycle detection" | **False** | Edge-uniqueness is enforced via `used_in_path` flag + hash table. No edge can appear twice in any single path. | | "O(n^k) complexity" | **Misleading** | Bounded by C(E,k) (edges choose k), not n^k. For bounded VLE like `[*1..3]`, total work is O(E · d²) where d is average degree. | | "UNION ALL semantics internally" | **Incorrect** | VLE is a custom DFS implemented as a PG Set-Returning Function (SRF). It does not use recursive CTEs. | | "Visits same nodes exponentially" | **Partially true** | Vertices CAN be revisited via different edges — this is **correct per the openCypher specification**, which mandates edge-isomorphism (no repeated edges), not vertex-isomorphism (no repeated vertices). | | "Fix: add visited-node set" | **Would break semantics** | Tracking visited nodes would change VLE from edge-unique to vertex-unique traversal, violating openCypher edge-isomorphism rules and producing incorrect results for queries that need all edge-unique paths. | --- ### The Benchmark Comparison Is Invalid The issue compares VLE (~23ms) against a recursive CTE with `UNION` (~0.08ms), but these compute **fundamentally different things**: - **VLE** returns **all edge-unique paths** — each as a full path object with vertices, edges, and properties. A `MATCH (n)-[*1..3]->(m) RETURN count(m)` enumerates every distinct path. - **Recursive CTE with UNION** returns **only reachable node IDs** with deduplication. It collapses all paths to the same node into one row. These are not equivalent operations. The CTE answers "which nodes are reachable?" while VLE answers "what are all the paths?" The performance difference reflects the algorithmic difference in output cardinality, not a bug. --- ### The Legitimate Performance Concern While the diagnosis is wrong, the underlying concern is valid: **VLE can be expensive on dense cyclic graphs** because edge-unique path enumeration is inherently combinatorial. On a graph where vertex V has high in/out degree and participates in many cycles, VLE will enumerate all edge-unique paths through V — visiting V many times via different edges. This is correct behavior, but expensive. The test graph in AGE's own regression suite (`regress/sql/cypher_vle.sql`) demonstrates this: a 5-vertex graph with self-loops and back-edges produces **7,092 undirected paths** with `[*1..∞]`. This is the expected combinatorial output for edge-unique enumeration over a cycle-rich topology. Additionally, when the upper bound is omitted (`[*]` or `[*1..]`), VLE sets `uidx_infinite = true` and the only termination guarantee is edge-uniqueness depletion. On a graph with |E| edges, the maximum path length is |E|, making the worst case O(|E|!) — factorial. This is a genuine scalability risk. --- ### Correct Characterization AGE's VLE implements **edge-isomorphic DFS path enumeration** as required by the openCypher specification. The complexity for bounded patterns `[*min..max]` is polynomial in graph size. For unbounded patterns, it is exponential in the worst case — but this is inherent to the problem of enumerating all edge-unique paths, not a missing optimization. What the reporter actually wants is **vertex-unique reachability** (BFS), which is a different operation than what VLE provides. The appropriate solutions would be: - Implementing `shortestPath()` / `allShortestPaths()` using BFS (not yet available in AGE) - Adding an optional vertex-unique traversal mode for reachability queries where full path enumeration is not needed - Adding a configurable safety depth cap (GUC) for unbounded VLE to prevent runaway queries on dense graphs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
