[
https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julian Hyde updated CALCITE-6188:
---------------------------------
Description:
Devise extensions to SQL for multi-query optimization (MQO). Queries with
multiple input tables, multiple intermediate tables, and multiple output data
sets (result sets and DML operations) can be defined in one SQL statement,
optimized, and executed atomically.
There are many flavors of multi-query optimization, depending on whether each
occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or
"several". Our goal, here, is to allow them all to be expressed. We can then
devise planning strategies that work for particular flavors.
Examples of multi-queries:
* {*}Multiple DML outputs{*}. An INSERT statement that writes into a table but
also updates an index,
* {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that
represents an ETL/ELT job;
* {*}Multiple query outputs{*}. A query that produces several data sets (say a
list of invoices for
orders and a list of products that need to be restocked);
* {*}DAG query{*}. A query that uses intermediate results more than once.
See discussion in the [Multi-query optimization email
thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
Here are some SQL examples.
We add a new keyword {{MULTI}} that represents a statement whose output
contains multiple data sets and DML operations, each with a unique name. For
intermediate results, we use the existing {{WITH}} clause.
h3. 1. Multi DML
Read from one or more tables, write to one or more tables.
An example is inserting into a table and also an index on that table
(represented as a sorted table).
{code:sql}
WITH
empDelta AS (
SELECT * FROM emp2
EXCEPT
SELECT * FROM emp)
MULTI
insertEmp AS (
INSERT INTO emp
TABLE empDelta),
insertEmpDeptno AS (
MERGE empDeptno AS e
USING TABLE empDelta AS d
ON e.deptno = d.deptno
WHEN NOT MATCHED THEN INSERT VALUES (deptno));
{code}
h3. 2. Query that creates temporary table and uses it more than once
{code:sql}
WITH
temp AS (
SELECT *
FROM emp AS e
JOIN dept USING (deptno)
WHERE e.job = 'MANAGER'
OR d.location = 'CHICAGO')
SELECT deptno,
(SELECT AVG(sal)
FROM temp AS t
WHERE t.deptno = e.deptno) AS deptAvgSal,
(SELECT AVG(sal)
FROM temp AS t
WHERE t.job = e.job) AS jobAvgSal
FROM e
WHERE e.deptno IN (10, 20);
{code}
h3. 3. Query whose optimal plan might use a temporary table
This query produces the same result as the previous query. There is a common
relational expression, so the optimizer should consider a DAG plan with a
reified intermediate result.
{code:sql}
SELECT deptno,
(SELECT AVG(e2.sal)
FROM emp AS e2
JOIN dept AS d USING (deptno)
WHERE (e2.job = 'MANAGER'
OR d.location = 'CHICAGO')
AND e2.deptno = e.deptno) AS deptAvgSal,
(SELECT AVG(e3.sal)
FROM emp AS e3
JOIN dept AS d USING (deptno)
WHERE (e3.job = 'MANAGER'
OR d.location = 'CHICAGO')
AND e3.job = e.job) AS jobAvgSal
FROM e
WHERE e.deptno IN (10, 20);
{code}
h3. 4. Query that produces several data sets
{code:sql}
WITH
newOrders AS (
SELECT *
FROM orders
WHERE orderDate > DATE '2023-01-25')
MULTI
invoices AS (
SELECT customerName, SUM(amount)
FROM newOrders
GROUP BY customerName),
restock AS (
SELECT productId
FROM inventory
WHERE productId IN (
SELECT productId FROM newOrders)
AND itemsOnHand < 10);
{code}
h3. 5. Query with a complex DAG, multiple output data sets and one DML
{code:sql}
WITH
t0 AS (
SELECT * FROM t WHERE x > 5),
t00 AS (
SELECT * FROM t0 WHERE y < 10),
t000 AS (
SELECT x, MIN(y) FROM t00 GROUP BY x),
t1 AS (
SELECT * FROM t WHERE y > 3),
t10 AS (
SELECT * FROM t1 WHERE x < 8),
t2 AS (
SELECT * FROM t000
INTERSECT
SELECT * FROM t10),
t3 AS (
SELECT * FROM u
INTERSECT
SELECT * FROM t00)
MULTI
q0 AS (SELECT * FROM t3),
q1 AS (SELECT * FROM t2),
d0 AS (DELETE FROM v WHERE v.x IN (SELECT x FROM t1));
{code}
The data flow is the following DAG:
{noformat}
u --------------------+
+--------------> t3 (q0)
+--> t0 --> t00 --+
| +--> t000 --+
t --+ +--> t2 (q1)
| +--> t10 -----------+
+--> t1 --+
+----------------------> d0
{noformat}
was:
Devise extensions to SQL so that queries with multiple input tables, multiple
intermediate tables, and multiple output data sets (result sets and DML
operations) can be defined in one SQL statement, optimized, and executed
atomically.
There are many flavors of multi-query optimization, depending on whether each
occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or
"several". Our goal, here, is to allow them all to be expressed. We can then
devise planning strategies that work for particular flavors.
Examples of multi-queries:
* {*}Multiple DML outputs{*}. An INSERT statement that writes into a table but
also updates an index,
* {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that
represents an ETL/ELT job;
* {*}Multiple query outputs{*}. A query that produces several data sets (say a
list of invoices for
orders and a list of products that need to be restocked);
* {*}DAG query{*}. A query that uses intermediate results more than once.
See discussion in the [Multi-query optimization email
thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
Here are some SQL examples.
We add a new keyword {{MULTI}} that represents a statement whose output
contains multiple data sets and DML operations, each with a unique name. For
intermediate results, we use the existing {{WITH}} clause.
h3. 1. Multi DML
Read from one or more tables, write to one or more tables.
An example is inserting into a table and also an index on that table
(represented as a sorted table).
{code:sql}
WITH
empDelta AS (
SELECT * FROM emp2
EXCEPT
SELECT * FROM emp)
MULTI
insertEmp AS (
INSERT INTO emp
TABLE empDelta),
insertEmpDeptno AS (
MERGE empDeptno AS e
USING TABLE empDelta AS d
ON e.deptno = d.deptno
WHEN NOT MATCHED THEN INSERT VALUES (deptno));
{code}
h3. 2. Query that creates temporary table and uses it more than once
{code:sql}
WITH
temp AS (
SELECT *
FROM emp AS e
JOIN dept USING (deptno)
WHERE e.job = 'MANAGER'
OR d.location = 'CHICAGO')
SELECT deptno,
(SELECT AVG(sal)
FROM temp AS t
WHERE t.deptno = e.deptno) AS deptAvgSal,
(SELECT AVG(sal)
FROM temp AS t
WHERE t.job = e.job) AS jobAvgSal
FROM e
WHERE e.deptno IN (10, 20);
{code}
h3. 3. Query whose optimal plan might use a temporary table
This query produces the same result as the previous query. There is a common
relational expression, so the optimizer should consider a DAG plan with a
reified intermediate result.
{code:sql}
SELECT deptno,
(SELECT AVG(e2.sal)
FROM emp AS e2
JOIN dept AS d USING (deptno)
WHERE (e2.job = 'MANAGER'
OR d.location = 'CHICAGO')
AND e2.deptno = e.deptno) AS deptAvgSal,
(SELECT AVG(e3.sal)
FROM emp AS e3
JOIN dept AS d USING (deptno)
WHERE (e3.job = 'MANAGER'
OR d.location = 'CHICAGO')
AND e3.job = e.job) AS jobAvgSal
FROM e
WHERE e.deptno IN (10, 20);
{code}
h3. 4. Query that produces several data sets
{code:sql}
WITH
newOrders AS (
SELECT *
FROM orders
WHERE orderDate > DATE '2023-01-25')
MULTI
invoices AS (
SELECT customerName, SUM(amount)
FROM newOrders
GROUP BY customerName),
restock AS (
SELECT productId
FROM inventory
WHERE productId IN (
SELECT productId FROM newOrders)
AND itemsOnHand < 10);
{code}
h3. 5. Query with a complex DAG, multiple output data sets and one DML
{code:sql}
WITH
t0 AS (
SELECT * FROM t WHERE x > 5),
t00 AS (
SELECT * FROM t0 WHERE y < 10),
t000 AS (
SELECT x, MIN(y) FROM t00 GROUP BY x),
t1 AS (
SELECT * FROM t WHERE y > 3),
t10 AS (
SELECT * FROM t1 WHERE x < 8),
t2 AS (
SELECT * FROM t000
INTERSECT
SELECT * FROM t10),
t3 AS (
SELECT * FROM u
INTERSECT
SELECT * FROM t00)
MULTI
q0 AS (SELECT * FROM t3),
q1 AS (SELECT * FROM t2),
d0 AS (DELETE FROM v WHERE v.x IN (SELECT x FROM t1));
{code}
The data flow is the following DAG:
{noformat}
u --------------------+
+--------------> t3 (q0)
+--> t0 --> t00 --+
| +--> t000 --+
t --+ +--> t2 (q1)
| +--> t10 -----------+
+--> t1 --+
+----------------------> d0
{noformat}
> Multi-query optimization
> ------------------------
>
> Key: CALCITE-6188
> URL: https://issues.apache.org/jira/browse/CALCITE-6188
> Project: Calcite
> Issue Type: Improvement
> Reporter: Julian Hyde
> Priority: Major
>
> Devise extensions to SQL for multi-query optimization (MQO). Queries with
> multiple input tables, multiple intermediate tables, and multiple output data
> sets (result sets and DML operations) can be defined in one SQL statement,
> optimized, and executed atomically.
> There are many flavors of multi-query optimization, depending on whether each
> occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or
> "several". Our goal, here, is to allow them all to be expressed. We can then
> devise planning strategies that work for particular flavors.
> Examples of multi-queries:
> * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table
> but also updates an index,
> * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that
> represents an ETL/ELT job;
> * {*}Multiple query outputs{*}. A query that produces several data sets (say
> a list of invoices for
> orders and a list of products that need to be restocked);
> * {*}DAG query{*}. A query that uses intermediate results more than once.
> See discussion in the [Multi-query optimization email
> thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
> Here are some SQL examples.
> We add a new keyword {{MULTI}} that represents a statement whose output
> contains multiple data sets and DML operations, each with a unique name. For
> intermediate results, we use the existing {{WITH}} clause.
> h3. 1. Multi DML
> Read from one or more tables, write to one or more tables.
> An example is inserting into a table and also an index on that table
> (represented as a sorted table).
> {code:sql}
> WITH
> empDelta AS (
> SELECT * FROM emp2
> EXCEPT
> SELECT * FROM emp)
> MULTI
> insertEmp AS (
> INSERT INTO emp
> TABLE empDelta),
> insertEmpDeptno AS (
> MERGE empDeptno AS e
> USING TABLE empDelta AS d
> ON e.deptno = d.deptno
> WHEN NOT MATCHED THEN INSERT VALUES (deptno));
> {code}
> h3. 2. Query that creates temporary table and uses it more than once
> {code:sql}
> WITH
> temp AS (
> SELECT *
> FROM emp AS e
> JOIN dept USING (deptno)
> WHERE e.job = 'MANAGER'
> OR d.location = 'CHICAGO')
> SELECT deptno,
> (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.deptno = e.deptno) AS deptAvgSal,
> (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 3. Query whose optimal plan might use a temporary table
> This query produces the same result as the previous query. There is a common
> relational expression, so the optimizer should consider a DAG plan with a
> reified intermediate result.
> {code:sql}
> SELECT deptno,
> (SELECT AVG(e2.sal)
> FROM emp AS e2
> JOIN dept AS d USING (deptno)
> WHERE (e2.job = 'MANAGER'
> OR d.location = 'CHICAGO')
> AND e2.deptno = e.deptno) AS deptAvgSal,
> (SELECT AVG(e3.sal)
> FROM emp AS e3
> JOIN dept AS d USING (deptno)
> WHERE (e3.job = 'MANAGER'
> OR d.location = 'CHICAGO')
> AND e3.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 4. Query that produces several data sets
> {code:sql}
> WITH
> newOrders AS (
> SELECT *
> FROM orders
> WHERE orderDate > DATE '2023-01-25')
> MULTI
> invoices AS (
> SELECT customerName, SUM(amount)
> FROM newOrders
> GROUP BY customerName),
> restock AS (
> SELECT productId
> FROM inventory
> WHERE productId IN (
> SELECT productId FROM newOrders)
> AND itemsOnHand < 10);
> {code}
> h3. 5. Query with a complex DAG, multiple output data sets and one DML
> {code:sql}
> WITH
> t0 AS (
> SELECT * FROM t WHERE x > 5),
> t00 AS (
> SELECT * FROM t0 WHERE y < 10),
> t000 AS (
> SELECT x, MIN(y) FROM t00 GROUP BY x),
> t1 AS (
> SELECT * FROM t WHERE y > 3),
> t10 AS (
> SELECT * FROM t1 WHERE x < 8),
> t2 AS (
> SELECT * FROM t000
> INTERSECT
> SELECT * FROM t10),
> t3 AS (
> SELECT * FROM u
> INTERSECT
> SELECT * FROM t00)
> MULTI
> q0 AS (SELECT * FROM t3),
> q1 AS (SELECT * FROM t2),
> d0 AS (DELETE FROM v WHERE v.x IN (SELECT x FROM t1));
> {code}
> The data flow is the following DAG:
> {noformat}
> u --------------------+
> +--------------> t3 (q0)
> +--> t0 --> t00 --+
> | +--> t000 --+
> t --+ +--> t2 (q1)
> | +--> t10 -----------+
> +--> t1 --+
> +----------------------> d0
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)