Bruno Messias created SPARK-55348:
-------------------------------------

             Summary: async_track expressions for efficient convergent 
algorithms in recursive CTEs (avoids exponential blowup)
                 Key: SPARK-55348
                 URL: https://issues.apache.org/jira/browse/SPARK-55348
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 4.1.1
            Reporter: Bruno Messias


This proposal introduces a *family of stateful expressions* (`async_track_min`, 
`async_track_max`, `async_track_...`) that prevent exponential row explosion in 
recursive CTEs for graph algorithms for example. These expressions filter 
redundant rows during iteration by tracking the "best" value seen for each key.

{*}Why it works{*}: For operators like `min`, `max`, and `first-seen` etc, the 
result is the same regardless of processing order. Spark's distributed task  is 
actually an advantage here: zero coordination overhead, with correctness 
guaranteed by the mathematical properties of these operators.

{*}Complexity improvement{*}: Without filtering, recursive CTEs enumerate all 
paths (exponential in graph density). With `async_track`, each key is accepted 
at most once per iteration, reducing worst-case complexity 


h3. Motivation: Need for Efficient Recursive CTEs

The introduction of recursive CTEs in Spark  was a significant step forward. I 
work in financial services where we are usubg recursive CTEs 

However, there's a hard ceiling: {*}row explosion{*}. Once graphs exceed a 
certain density or depth, queries become impractical due to exponential 
intermediate row growth. This forces teams to either:
 * Limit query scope to small subgraphs
 * Accept long runtimes and high compute costs
 * Move data to external graph systems (breaking the SQL-native workflow) and 
going outside from databrick



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to