peter-toth opened a new pull request, #29210: URL: https://github.com/apache/spark/pull/29210
### What changes were proposed in this pull request? This PR adds recursive query feature to Spark SQL. A recursive query is defined using the `WITH RECURSIVE` keywords and referring the name of the common table expression within the query. The implementation complies with SQL standard and follows similar rules to other relational databases: - A query is made of an anchor followed by a recursive term. - The anchor terms doesn't contain self reference and it is used to initialize the query. - The recursive term contains a self reference and it is used to expand the current set of rows with new ones. - The anchor and recursive terms must be joined with each other by `UNION` or `UNION ALL` operators. - New rows can only be derived from the newly added rows of the previous iteration (or from the initial set of rows of anchor terms). This limitation implies that recursive references can't be used with some of the joins, aggregations or subqueries. Please see `cte-recursive.sql` and `with.sql` for some examples. Please note that this PR focuses on the minimal working implementation which means: - SQL recursion is actually loop where the current iteration is computed based on the previous one's result and when an iteration returns no rows the loop is over. The final result is the union of all iteration results. This means that caching intermediate results could speed up the process, but caching was removed from this PR to reduce complexity and can be added back in a follow-up PR. - A common way to stop SQL recursion is using the LIMIT operator to stop computing more than the required number of rows. LIMIT support was removed from this PR to reduce complexity and can be added back in a follow-up PR. - Some relational databases are more relaxed in terms how many anchor and recursive terms can be in a recursion. This PR allows the most simple case and allows only 1-1 of them. A follow-up PR can target to relax this limitation. ### Why are the changes needed? Recursive query is an ANSI SQL feature that is useful to process hierarchical data. ### Does this PR introduce _any_ user-facing change? Yes, adds recursive query feature. ### How was this patch tested? Added new UTs and tests in `cte-recursion.sql` and `with.sql`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
