Julian Hyde created CALCITE-6716:
------------------------------------

             Summary: WITH REPEATABLE (deterministic CTEs)
                 Key: CALCITE-6716
                 URL: https://issues.apache.org/jira/browse/CALCITE-6716
             Project: Calcite
          Issue Type: Bug
            Reporter: Julian Hyde


Add a REPEATABLE keyword to the WITH clause (that defines common-table 
expressions, CTEs) to ensure that if the CTE is used more than once in the 
query it will return the same set of rows each time (modulo sorting).

In SQL, most functions and relational operators are deterministic (return the 
same results given the same inputs/arguments). Base tables (due to isolation 
semantics) will return the same result if used twice in a query. Views and CTEs 
will return the same results if all of their constituent tables, functions and 
operators are deterministic.

A CTE is repeatable if it is based on a deterministic query. If the query is 
nondeterministic, we can make the CTE deterministic by materializing it 
(storing the results in memory or temp disk the first it is used). In many 
systems CTEs are always materialized or it is an option see 
([Postgres|https://dba.stackexchange.com/questions/257014/are-there-side-effects-to-postgres-12s-not-materialized-directive]).

But what if you want repeatable results without paying the cost of 
materialization? Whether to materialize is a physical decision ('how') that can 
be left to the optimizer, whereas repeatability is a logical requirement 
('what'). SQL works best when the syntax allows people to ask for the 'what' 
without specifying the 'how'.

Here is an example with proposed syntax:
{code}
WITH topManagers AS REPEATABLE (
  SELECT *
  FROM employees
  WHERE job = 'MANAGER'
  ORDER BY sal
  LIMIT 5)
SELECT *
FROM topManagers
MINUS
SELECT *
FROM topManagers
{code}
Although topManagers is nondeterministic -- there could be two managers with 
the same salary -- this query will always return 0 rows.

The implementation requires two tools:
 * Determine whether a query is deterministic
 * Modify a nondeterministic query so that it is deterministic (within the same 
statement execution) and returns one of the valid results. Operation may fail, 
and say 'I cannot make this deterministic'.

h2. Nondeterministic features and how to fix them

Here are some nondeterministic features, and ways to make them deterministic.

h3. LIMIT and OFFSET

A query with LIMIT or OFFSET is nondeterministic when there is ORDER BY or with 
the ORDER BY is not exhaustive (may have ties). Example:
{code}
SELECT * FROM employees ORDER BY sal LIMIT 5
{code}
May be fixed by making the ORDER BY deterministic, e.g. by adding primary key 
to break ties:
{code}
SELECT * FROM employees ORDER BY sal, empno LIMIT 5
{code}
(Not possible if there is no primary key.)

h3. ROWS inside window aggregate

ROWS inside a window aggregate has a similar problem to LIMIT and OFFSET, and 
similar solution:
{code}
SELECT avg(sal) OVER (ORDER BY hiredate ROWS 3 PRECEDING)
FROM employees
{code}
Fixed:
{code}
SELECT avg(sal) OVER (ORDER BY hiredate, empno ROWS 3 PRECEDING)
FROM employees
{code}

h3. RANK and DENSE_RANK

TODO

h3. RANDOM

Not easily fixed.

h3. GROUP BY strings with case-insensitive collation

Suppose job is a VARCHAR column with case-insensitive collation. Suppose that 
the values 'Sales' and 'SALES' both exist, and they collate the same. If we 
execute the query
{code}
SELECT DISTINCT job
FROM employees
{code}
the result will either contain 'Sales' or 'SALES', but which occurs depends on 
which value lands in the hash table first. We can fix as follows:
{code}
SELECT MIN(job COLLATE rawBits)
FROM employees
GROUP BY job
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to