jrgemignani commented on issue #2349:
URL: https://github.com/apache/age/issues/2349#issuecomment-3999715962

   @gregfelice The following is from Opus, because it can explain it better 
than I, as it has been a while since I wrote the VLE.
   
   ## Analysis of Issue #2349: VLE path expansion cycle detection claims
   
   ### Summary
   
   Issue #2349 claims that AGE's VLE (variable-length edge) expansion has 
O(n^k) complexity with "no cycle detection" and proposes adding a visited-node 
set. After a thorough review of the entire VLE implementation 
(`src/backend/utils/adt/age_vle.c`, 2692 lines), **the claims are substantially 
incorrect**. AGE's VLE already implements cycle prevention via edge-uniqueness 
enforcement — consistent with the openCypher specification. The performance 
concern is real but misdiagnosed.
   
   ---
   
   ### How VLE Actually Works
   
   The VLE implementation uses a **DFS (depth-first search) with per-path 
edge-uniqueness enforcement**, not unchecked recursive expansion.
   
   **Core data structures** (age_vle.c lines 42–97):
   - `edge_state_hashtable` — per-edge state tracking (100K initial hash 
buckets)
   - `edge_state_entry.used_in_path` — boolean flag marking edges currently in 
the active DFS path
   - `dfs_path_stack` — the current path being explored
   - `dfs_edge_stack` — working stack of candidate edges
   - `dfs_vertex_stack` — parent vertex tracking (used only for undirected 
traversal)
   
   **Cycle prevention operates at two levels:**
   
   1. **In `add_valid_vertex_edges()`** (lines 1206–1367): Before adding an 
edge to the exploration stack, the code checks `ese->used_in_path`. If the edge 
is already in the current path, it is skipped. For small paths (<10 edges), a 
fast linear scan via `is_edge_in_path()` is used before falling through to the 
hash lookup.
   
   2. **In `dfs_find_a_path_between()` and `dfs_find_a_path_from()`** (lines 
933–1176): When an edge is popped that has `used_in_path == true`, the 
algorithm recognizes it is **backtracking** — it pops the edge from the path 
stack, resets `used_in_path = false`, and continues to the next candidate. This 
is textbook DFS backtracking with edge-uniqueness.
   
   3. **Upper bound enforcement**: Both DFS functions check 
`get_stack_size(path_stack) < vlelctx->uidx` before expanding deeper. A pattern 
like `[*1..3]` is strictly bounded at depth 3.
   
   ---
   
   ### Point-by-Point Response to Issue Claims
   
   | Claim | Verdict | Explanation |
   |---|---|---|
   | "No cycle detection" | **False** | Edge-uniqueness is enforced via 
`used_in_path` flag + hash table. No edge can appear twice in any single path. |
   | "O(n^k) complexity" | **Misleading** | Bounded by C(E,k) (edges choose k), 
not n^k. For bounded VLE like `[*1..3]`, total work is O(E · d²) where d is 
average degree. |
   | "UNION ALL semantics internally" | **Incorrect** | VLE is a custom DFS 
implemented as a PG Set-Returning Function (SRF). It does not use recursive 
CTEs. |
   | "Visits same nodes exponentially" | **Partially true** | Vertices CAN be 
revisited via different edges — this is **correct per the openCypher 
specification**, which mandates edge-isomorphism (no repeated edges), not 
vertex-isomorphism (no repeated vertices). |
   | "Fix: add visited-node set" | **Would break semantics** | Tracking visited 
nodes would change VLE from edge-unique to vertex-unique traversal, violating 
openCypher edge-isomorphism rules and producing incorrect results for queries 
that need all edge-unique paths. |
   
   ---
   
   ### The Benchmark Comparison Is Invalid
   
   The issue compares VLE (~23ms) against a recursive CTE with `UNION` 
(~0.08ms), but these compute **fundamentally different things**:
   
   - **VLE** returns **all edge-unique paths** — each as a full path object 
with vertices, edges, and properties. A `MATCH (n)-[*1..3]->(m) RETURN 
count(m)` enumerates every distinct path.
   - **Recursive CTE with UNION** returns **only reachable node IDs** with 
deduplication. It collapses all paths to the same node into one row.
   
   These are not equivalent operations. The CTE answers "which nodes are 
reachable?" while VLE answers "what are all the paths?" The performance 
difference reflects the algorithmic difference in output cardinality, not a bug.
   
   ---
   
   ### The Legitimate Performance Concern
   
   While the diagnosis is wrong, the underlying concern is valid: **VLE can be 
expensive on dense cyclic graphs** because edge-unique path enumeration is 
inherently combinatorial. On a graph where vertex V has high in/out degree and 
participates in many cycles, VLE will enumerate all edge-unique paths through V 
— visiting V many times via different edges. This is correct behavior, but 
expensive.
   
   The test graph in AGE's own regression suite (`regress/sql/cypher_vle.sql`) 
demonstrates this: a 5-vertex graph with self-loops and back-edges produces 
**7,092 undirected paths** with `[*1..∞]`. This is the expected combinatorial 
output for edge-unique enumeration over a cycle-rich topology.
   
   Additionally, when the upper bound is omitted (`[*]` or `[*1..]`), VLE sets 
`uidx_infinite = true` and the only termination guarantee is edge-uniqueness 
depletion. On a graph with |E| edges, the maximum path length is |E|, making 
the worst case O(|E|!) — factorial. This is a genuine scalability risk.
   
   ---
   
   ### Correct Characterization
   
   AGE's VLE implements **edge-isomorphic DFS path enumeration** as required by 
the openCypher specification. The complexity for bounded patterns `[*min..max]` 
is polynomial in graph size. For unbounded patterns, it is exponential in the 
worst case — but this is inherent to the problem of enumerating all edge-unique 
paths, not a missing optimization.
   
   What the reporter actually wants is **vertex-unique reachability** (BFS), 
which is a different operation than what VLE provides. The appropriate 
solutions would be:
   - Implementing `shortestPath()` / `allShortestPaths()` using BFS (not yet 
available in AGE)
   - Adding an optional vertex-unique traversal mode for reachability queries 
where full path enumeration is not needed
   - Adding a configurable safety depth cap (GUC) for unbounded VLE to prevent 
runaway queries on dense graphs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to