Julian Hyde created CALCITE-6716: ------------------------------------ Summary: WITH REPEATABLE (deterministic CTEs) Key: CALCITE-6716 URL: https://issues.apache.org/jira/browse/CALCITE-6716 Project: Calcite Issue Type: Bug Reporter: Julian Hyde
Add a REPEATABLE keyword to the WITH clause (that defines common-table expressions, CTEs) to ensure that if the CTE is used more than once in the query it will return the same set of rows each time (modulo sorting). In SQL, most functions and relational operators are deterministic (return the same results given the same inputs/arguments). Base tables (due to isolation semantics) will return the same result if used twice in a query. Views and CTEs will return the same results if all of their constituent tables, functions and operators are deterministic. A CTE is repeatable if it is based on a deterministic query. If the query is nondeterministic, we can make the CTE deterministic by materializing it (storing the results in memory or temp disk the first it is used). In many systems CTEs are always materialized or it is an option see ([Postgres|https://dba.stackexchange.com/questions/257014/are-there-side-effects-to-postgres-12s-not-materialized-directive]). But what if you want repeatable results without paying the cost of materialization? Whether to materialize is a physical decision ('how') that can be left to the optimizer, whereas repeatability is a logical requirement ('what'). SQL works best when the syntax allows people to ask for the 'what' without specifying the 'how'. Here is an example with proposed syntax: {code} WITH topManagers AS REPEATABLE ( SELECT * FROM employees WHERE job = 'MANAGER' ORDER BY sal LIMIT 5) SELECT * FROM topManagers MINUS SELECT * FROM topManagers {code} Although topManagers is nondeterministic -- there could be two managers with the same salary -- this query will always return 0 rows. The implementation requires two tools: * Determine whether a query is deterministic * Modify a nondeterministic query so that it is deterministic (within the same statement execution) and returns one of the valid results. Operation may fail, and say 'I cannot make this deterministic'. h2. Nondeterministic features and how to fix them Here are some nondeterministic features, and ways to make them deterministic. h3. LIMIT and OFFSET A query with LIMIT or OFFSET is nondeterministic when there is ORDER BY or with the ORDER BY is not exhaustive (may have ties). Example: {code} SELECT * FROM employees ORDER BY sal LIMIT 5 {code} May be fixed by making the ORDER BY deterministic, e.g. by adding primary key to break ties: {code} SELECT * FROM employees ORDER BY sal, empno LIMIT 5 {code} (Not possible if there is no primary key.) h3. ROWS inside window aggregate ROWS inside a window aggregate has a similar problem to LIMIT and OFFSET, and similar solution: {code} SELECT avg(sal) OVER (ORDER BY hiredate ROWS 3 PRECEDING) FROM employees {code} Fixed: {code} SELECT avg(sal) OVER (ORDER BY hiredate, empno ROWS 3 PRECEDING) FROM employees {code} h3. RANK and DENSE_RANK TODO h3. RANDOM Not easily fixed. h3. GROUP BY strings with case-insensitive collation Suppose job is a VARCHAR column with case-insensitive collation. Suppose that the values 'Sales' and 'SALES' both exist, and they collate the same. If we execute the query {code} SELECT DISTINCT job FROM employees {code} the result will either contain 'Sales' or 'SALES', but which occurs depends on which value lands in the hash table first. We can fix as follows: {code} SELECT MIN(job COLLATE rawBits) FROM employees GROUP BY job {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)