Bruno Messias created SPARK-55348:
-------------------------------------
Summary: async_track expressions for efficient convergent
algorithms in recursive CTEs (avoids exponential blowup)
Key: SPARK-55348
URL: https://issues.apache.org/jira/browse/SPARK-55348
Project: Spark
Issue Type: New Feature
Components: SQL
Affects Versions: 4.1.1
Reporter: Bruno Messias
This proposal introduces a *family of stateful expressions* (`async_track_min`,
`async_track_max`, `async_track_...`) that prevent exponential row explosion in
recursive CTEs for graph algorithms for example. These expressions filter
redundant rows during iteration by tracking the "best" value seen for each key.
{*}Why it works{*}: For operators like `min`, `max`, and `first-seen` etc, the
result is the same regardless of processing order. Spark's distributed task is
actually an advantage here: zero coordination overhead, with correctness
guaranteed by the mathematical properties of these operators.
{*}Complexity improvement{*}: Without filtering, recursive CTEs enumerate all
paths (exponential in graph density). With `async_track`, each key is accepted
at most once per iteration, reducing worst-case complexity
h3. Motivation: Need for Efficient Recursive CTEs
The introduction of recursive CTEs in Spark was a significant step forward. I
work in financial services where we are usubg recursive CTEs
However, there's a hard ceiling: {*}row explosion{*}. Once graphs exceed a
certain density or depth, queries become impractical due to exponential
intermediate row growth. This forces teams to either:
* Limit query scope to small subgraphs
* Accept long runtimes and high compute costs
* Move data to external graph systems (breaking the SQL-native workflow) and
going outside from databrick
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]