[jira] [Updated] (CALCITE-4707) Optimize incremental maintenance of materialized views

Julian Hyde (Jira) Wed, 28 Jul 2021 15:37:04 -0700


     [ 
https://issues.apache.org/jira/browse/CALCITE-4707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Julian Hyde updated CALCITE-4707:
---------------------------------
    Description: 
Optimize incremental maintenance of materialized views, when we know what DML 
has occurred since the last build.

The goal here is to develop an algebraic approach (i.e. new relational 
operators and transformation rules) that will generate an optimal query for 
maintaining the view.

Consider a materialized view
{code}
CREATE TABLE EmpsByDeptno AS
  SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
  FROM Emps
  GROUP BY deptno;
{code}

We built it at time T1, when the value of {{Emps}} was {{Emps1}}, and it had 
the value {{EmpsByDeptno1}}.

It is now T2, and since then, new employees have been created in the 
{{NewEmps}} table. Thus {{Emps2 = Emps1 UNION ALL NewEmps}}. We could build a 
new MV,
{code}
CREATE TABLE EmpsByDeptno2 AS
  SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
  FROM (SELECT * FROM Emps1
      UNION ALL
      SELECT * FROM NewEmps)
  GROUP BY deptno;
{code}but we would prefer to generate a DML command to modify {{EmpsByDeptno}} 
in place:{code}
MERGE INTO EmpsByDeptno AS e
  USING (SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
      FROM NewEmps) AS de
  ON de.deptno = e.deptno
  WHEN MATCHED THEN
    UPDATE SET c = c + de.c, ss = ss + de.ss
  WHEN NOT MATCHED THEN
    INSERT (deptno, c, ss)
{code}

We propose two new relational operators:
 * {{Diff(S, R)}} computes the difference between two relations. It has a same 
schema as R and S but with an additional column {{delta}}. Rows in S but not in 
R are called insertions, and have delta=1. Rows that are in R but not in S are 
called deletions, and have delta=-1.
 * {{Patch(R, P)}} applies a difference {{P}} to a relation. {{P}} has the same 
schema as {{R}} but with an additional delta column. The output is the rows of 
R, minus the rows in {{P}} that have delta=-1, union the rows in {{P}} that 
have delta=1.

The relational operators are named by analogy with the UNIX utilities {{diff}} 
and {{patch}}. (But {{Diff}}'s arguments are reversed compared to {{diff}}.) 
Consider:
{code}
grep r /usr/share/dict/words  > r.txt
grep s /usr/share/dict/words  > s.txt
diff r.txt s.txt > p.patch
patch -p1 r.txt < p.patch
cmp r.txt s.txt.  # r and s are now identical
{code}

The relation {code}R = Patch(S, Diff(R, S)){code} always holds. {{Diff}} 
computes the difference between R and S, and when {{Patch}} applies that 
difference to {{S}} you get {{R}}.

{{Patch}} and {{Diff}} are intended by be generalizations of set operators. If 
there are only additions, i.e. {{R}} is a superset of {{S}}, then you can 
substitute {{Minus}} for {{Diff}}, and {{Union}} for {{Patch}}:
{code}
R = Union(S, Minus(R, S))
{code}

So, how do these relational operators solve our original problem?

An {{INSERT}} statement is assignment to a relation of itself union some new 
rows. ({{INSERT INTO R SELECT * FROM P}} is {{R &larr; R UNION P}}).

A {{MERGE}} statement is similar to {{INSERT}} but allows updates.

The {{MERGE}} query above is represented as {code}EmpsByDeptno &larr; 
Patch(EmpsByDeptno, Diff(NewEmpsByDeptno, EmpsByDeptno)){code}

Lastly we need relational transformation rules to optimize the expressions into 
per-key updates.

  was:
Optimize incremental maintenance of materialized views, when we know what DML 
has occurred since the last build.

The goal here is to develop an algebraic approach (i.e. new relational 
operators and transformation rules) that will generate an optimal query for 
maintaining the view.

Consider a materialized view
{code}
CREATE TABLE EmpsByDeptno AS
  SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
  FROM Emps
  GROUP BY deptno;
{code}

We built it at time T1, when the value of {{Emps}} was {{Emps1}}, and it had 
the value {{EmpsByDeptno1}}.

It is now T2, and since then, new employees have been created in the 
{{NewEmps}} table. Thus {{Emps2 = Emps1 UNION ALL NewEmps}}. We could build a 
new MV,
{code}
CREATE TABLE EmpsByDeptno2 AS
  SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
  FROM (SELECT * FROM Emps1
      UNION ALL
      SELECT * FROM NewEmps)
  GROUP BY deptno;
{code}but we would prefer to generate a DML command to modify {{EmpsByDeptno}} 
in place:{code}
MERGE INTO EmpsByDeptno AS e
  USING (SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
      FROM NewEmps) AS de
  ON de.deptno = e.deptno
  WHEN MATCHED THEN
    UPDATE SET c = c + de.c, ss = ss + de.ss
  WHEN NOT MATCHED THEN
    INSERT (deptno, c, ss)
{code}

We propose two new relational operators:
 * {{Diff(S, R)}} computes the difference between two relations. It has a same 
schema as R and S but with an additional column {{delta}}. Rows in S but not in 
R are called insertions, and have delta=1. Rows that are in R but not in S are 
called deletions, and have delta=-1.
 * {{Patch(R, P)}} applies 

The relational operators are named by analogy with the UNIX utilities {{diff}} 
and {{patch}}. (But {{Diff}}'s arguments are reversed compared to {{diff}}.) 
Consider:
{code}
grep r /usr/share/dict/words  > r.txt
grep s /usr/share/dict/words  > s.txt
diff r.txt s.txt > p.patch
patch -p1 r.txt < p.patch
cmp r.txt s.txt.  # r and s are now identical
{code}

The relation {code}R = Patch(S, Diff(R, S)){code} always holds. {{Diff}} 
computes the difference between R and S, and when {{Patch}} applies that 
difference to {{S}} you get {{R}}.

{{Patch}} and {{Diff}} are intended by be generalizations of set operators. If 
there are only additions, i.e. {{R}} is a superset of {{S}}, then you can 
substitute {{Minus}} for {{Diff}}, and {{Union}} for {{Patch}}:
{code}
R = Union(S, Minus(R, S))
{code}

So, how do these relational operators solve our original problem?

An {{INSERT}} statement is assignment to a relation of itself union some new 
rows. ({{INSERT INTO R SELECT * FROM P}} is {{R &larr; R UNION P}}).

A {{MERGE}} statement is similar to {{INSERT}} but allows updates.

The {{MERGE}} query above is represented as {code}EmpsByDeptno &larr; 
Patch(EmpsByDeptno, Diff(NewEmpsByDeptno, EmpsByDeptno)){code}

Lastly we need relational transformation rules to optimize the expressions into 
per-key updates.


> Optimize incremental maintenance of materialized views
> ------------------------------------------------------
>
>                 Key: CALCITE-4707
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4707
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Priority: Major
>
> Optimize incremental maintenance of materialized views, when we know what DML 
> has occurred since the last build.
> The goal here is to develop an algebraic approach (i.e. new relational 
> operators and transformation rules) that will generate an optimal query for 
> maintaining the view.
> Consider a materialized view
> {code}
> CREATE TABLE EmpsByDeptno AS
>   SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
>   FROM Emps
>   GROUP BY deptno;
> {code}
> We built it at time T1, when the value of {{Emps}} was {{Emps1}}, and it had 
> the value {{EmpsByDeptno1}}.
> It is now T2, and since then, new employees have been created in the 
> {{NewEmps}} table. Thus {{Emps2 = Emps1 UNION ALL NewEmps}}. We could build a 
> new MV,
> {code}
> CREATE TABLE EmpsByDeptno2 AS
>   SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
>   FROM (SELECT * FROM Emps1
>       UNION ALL
>       SELECT * FROM NewEmps)
>   GROUP BY deptno;
> {code}but we would prefer to generate a DML command to modify 
> {{EmpsByDeptno}} in place:{code}
> MERGE INTO EmpsByDeptno AS e
>   USING (SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
>       FROM NewEmps) AS de
>   ON de.deptno = e.deptno
>   WHEN MATCHED THEN
>     UPDATE SET c = c + de.c, ss = ss + de.ss
>   WHEN NOT MATCHED THEN
>     INSERT (deptno, c, ss)
> {code}
> We propose two new relational operators:
>  * {{Diff(S, R)}} computes the difference between two relations. It has a 
> same schema as R and S but with an additional column {{delta}}. Rows in S but 
> not in R are called insertions, and have delta=1. Rows that are in R but not 
> in S are called deletions, and have delta=-1.
>  * {{Patch(R, P)}} applies a difference {{P}} to a relation. {{P}} has the 
> same schema as {{R}} but with an additional delta column. The output is the 
> rows of R, minus the rows in {{P}} that have delta=-1, union the rows in 
> {{P}} that have delta=1.
> The relational operators are named by analogy with the UNIX utilities 
> {{diff}} and {{patch}}. (But {{Diff}}'s arguments are reversed compared to 
> {{diff}}.) Consider:
> {code}
> grep r /usr/share/dict/words  > r.txt
> grep s /usr/share/dict/words  > s.txt
> diff r.txt s.txt > p.patch
> patch -p1 r.txt < p.patch
> cmp r.txt s.txt.  # r and s are now identical
> {code}
> The relation {code}R = Patch(S, Diff(R, S)){code} always holds. {{Diff}} 
> computes the difference between R and S, and when {{Patch}} applies that 
> difference to {{S}} you get {{R}}.
> {{Patch}} and {{Diff}} are intended by be generalizations of set operators. 
> If there are only additions, i.e. {{R}} is a superset of {{S}}, then you can 
> substitute {{Minus}} for {{Diff}}, and {{Union}} for {{Patch}}:
> {code}
> R = Union(S, Minus(R, S))
> {code}
> So, how do these relational operators solve our original problem?
> An {{INSERT}} statement is assignment to a relation of itself union some new 
> rows. ({{INSERT INTO R SELECT * FROM P}} is {{R &larr; R UNION P}}).
> A {{MERGE}} statement is similar to {{INSERT}} but allows updates.
> The {{MERGE}} query above is represented as {code}EmpsByDeptno &larr; 
> Patch(EmpsByDeptno, Diff(NewEmpsByDeptno, EmpsByDeptno)){code}
> Lastly we need relational transformation rules to optimize the expressions 
> into per-key updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (CALCITE-4707) Optimize incremental maintenance of materialized views

Reply via email to