[jira] [Commented] (CALCITE-6188) Multi-query optimization

Stamatis Zampetakis (Jira) Mon, 22 Jan 2024 06:37:29 -0800


    [ 
https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809478#comment-17809478
 ]


Stamatis Zampetakis commented on CALCITE-6188:
----------------------------------------------

Regarding the Multi DML use case, some systems (e.g., Hive, Snowflake) provide 
a SQL syntax for inserting data into multiple tables, usually know as MULTI 
TABLE INSERT statements.

The Hive syntax can be found 
[here|https://cwiki.apache.org/confluence/display/hive/languagemanual+dml#LanguageManualDML-InsertingdataintoHiveTablesfromqueries]
 and a basic example is outlined below. 
{code:sql}
FROM (SELECT * FROM emp2 EXCEPT SELECT * FROM emp) empDelta
INSERT INTO TABLE tbl1 SELECT * WHERE empno > 100
INSERT INTO TABLE tbl2 SELECT * WHERE empno < 50;
{code}
The Snowflake syntax along with examples can be found 
[here|https://docs.snowflake.com/en/sql-reference/sql/insert-multi-table].

> Multi-query optimization
> ------------------------
>
>                 Key: CALCITE-6188
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6188
>             Project: Calcite
>          Issue Type: Improvement
>            Reporter: Julian Hyde
>            Priority: Major
>
> Devise extensions to SQL for multi-query optimization (MQO). Queries with 
> multiple input tables, multiple intermediate tables, and multiple output data 
> sets (result sets and DML operations) can be defined in one SQL statement, 
> optimized, and executed atomically.
> There are many flavors of multi-query optimization, depending on whether each 
> occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or 
> "several". Our goal, here, is to allow them all to be expressed. We can then 
> devise planning strategies that work for particular flavors.
> Examples of multi-queries:
>  * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table 
> but also updates an index,
>  * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that 
> represents an ETL/ELT job;
>  * {*}Multiple query outputs{*}. A query that produces several data sets (say 
> a list of invoices for
> orders and a list of products that need to be restocked);
>  * {*}DAG query{*}. A query that uses intermediate results more than once.
> See discussion in the [Multi-query optimization email 
> thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
> Here are some SQL examples.
> We add a new keyword {{MULTI}} that represents a statement whose output 
> contains multiple data sets and DML operations, each with a unique name. For 
> intermediate results, we use the existing {{WITH}} clause.
> h3. 1. Multi DML
> Read from one or more tables, write to one or more tables.
> An example is inserting into a table and also an index on that table 
> (represented as a sorted table).
> {code:sql}
> WITH
>   empDelta AS (
>       SELECT * FROM emp2
>       EXCEPT
>       SELECT * FROM emp)
> MULTI
>   insertEmp AS (
>     INSERT INTO emp
>     TABLE empDelta),
>   insertEmpDeptno AS (
>     MERGE empDeptno AS e
>     USING TABLE empDelta AS d
>     ON e.deptno = d.deptno
>     WHEN NOT MATCHED THEN INSERT VALUES (deptno));
> {code}
> h3. 2. Query that creates temporary table and uses it more than once
> {code:sql}
> WITH
>   temp AS (
>     SELECT *
>     FROM emp AS e
>     JOIN dept USING (deptno)
>     WHERE e.job = 'MANAGER'
>     OR d.location = 'CHICAGO')
> SELECT deptno,
>   (SELECT AVG(sal)
>     FROM temp AS t
>     WHERE t.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(sal)
>     FROM temp AS t
>     WHERE t.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 3. Query whose optimal plan might use a temporary table
> This query produces the same result as the previous query. There is a common 
> relational expression, so the optimizer should consider a DAG plan with a 
> reified intermediate result.
> {code:sql}
> SELECT deptno,
>   (SELECT AVG(e2.sal)
>     FROM emp AS e2
>     JOIN dept AS d USING (deptno)
>     WHERE (e2.job = 'MANAGER'
>       OR d.location = 'CHICAGO')
>     AND e2.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(e3.sal)
>     FROM emp AS e3
>     JOIN dept AS d USING (deptno)
>     WHERE (e3.job = 'MANAGER'
>       OR d.location = 'CHICAGO')
>     AND e3.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 4. Query that produces several data sets
> {code:sql}
> WITH
>   newOrders AS (
>     SELECT *
>     FROM orders
>     WHERE orderDate > DATE '2023-01-25')
> MULTI
>   invoices AS (
>     SELECT customerName, SUM(amount)
>     FROM newOrders
>     GROUP BY customerName),
>   restock AS (
>     SELECT productId
>     FROM inventory
>     WHERE productId IN (
>       SELECT productId FROM newOrders)
>     AND itemsOnHand < 10);
> {code}
> h3. 5. Query with a complex DAG, multiple output data sets and one DML
> {code:sql}
> WITH
>   t0 AS (
>     SELECT * FROM t WHERE x > 5),
>   t00 AS (
>     SELECT * FROM t0 WHERE y < 10),
>   t000 AS (
>     SELECT x, MIN(y) FROM t00 GROUP BY x),
>   t1 AS (
>     SELECT * FROM t WHERE y > 3),
>   t10 AS (
>     SELECT * FROM t1 WHERE x < 8),
>   t2 AS (
>     SELECT * FROM t000
>     INTERSECT
>     SELECT * FROM t10),
>   t3 AS (
>     SELECT * FROM u
>     INTERSECT
>     SELECT * FROM t00)
> MULTI
>   q0 AS (SELECT * FROM t3),
>   q1 AS (SELECT * FROM t2),
>   d0 AS (DELETE FROM v WHERE v.x IN (SELECT x FROM t1));
> {code}
> The data flow is the following DAG:
> {noformat}
> u --------------------+
>                       +--------------> t3 (q0)
>     +--> t0 --> t00 --+
>     |                 +--> t000 --+
> t --+                             +--> t2 (q1)
>     |         +--> t10 -----------+
>     +--> t1 --+
>               +----------------------> d0
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (CALCITE-6188) Multi-query optimization

Reply via email to