[ 
https://issues.apache.org/jira/browse/CALCITE-4707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393390#comment-17393390
 ] 

Julian Hyde edited comment on CALCITE-4707 at 8/4/21, 6:42 PM:
---------------------------------------------------------------

Thanks [~jcamachorodriguez]! It's helpful to know how people have solved the 
same problem. Maybe I'll be able to use it - or it will be a resource to others 
solving this problem.

By the way, delta records - the output from the {{Diff}} operator - are 
directly applicable to Hive, since a Hive table is a union of partitions, some 
of which are regular and some of which are deltas. You might find it 
interesting to look into relational algebra as extended with {{Diff}} and 
{{Patch}} operators.


was (Author: julianhyde):
Thanks [~jcamachorodriguez]! It's helpful to know how people have solved the 
same problem. Maybe I'll be able to use it - or it will be a resource to others 
solving this problem.

> Optimize incremental maintenance of materialized views
> ------------------------------------------------------
>
>                 Key: CALCITE-4707
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4707
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Priority: Major
>
> Optimize incremental maintenance of materialized views, when we know what DML 
> has occurred since the last build.
> The goal here is to develop an algebraic approach (i.e. new relational 
> operators and transformation rules) that will generate an optimal query for 
> maintaining the view.
> Consider a materialized view
> {code}
> CREATE TABLE EmpsByDeptno AS
>   SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
>   FROM Emps
>   GROUP BY deptno;
> {code}
> We built it at time T1, when the value of {{Emps}} was {{Emps1}}, and it had 
> the value {{EmpsByDeptno1}}.
> It is now T2, and since then, new employees have been created in the 
> {{NewEmps}} table. Thus {{Emps2 = Emps1 UNION ALL NewEmps}}. We could build a 
> new MV,
> {code}
> CREATE TABLE EmpsByDeptno2 AS
>   SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
>   FROM (SELECT * FROM Emps1
>       UNION ALL
>       SELECT * FROM NewEmps)
>   GROUP BY deptno;
> {code}but we would prefer to generate a DML command to modify 
> {{EmpsByDeptno}} in place:{code}
> MERGE INTO EmpsByDeptno AS e
>   USING (SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
>       FROM NewEmps) AS de
>   ON de.deptno = e.deptno
>   WHEN MATCHED THEN
>     UPDATE SET c = c + de.c, ss = ss + de.ss
>   WHEN NOT MATCHED THEN
>     INSERT (deptno, c, ss)
> {code}
> We propose two new relational operators:
>  * {{Diff(S, R)}} computes the difference between two relations. It has a 
> same schema as R and S but with an additional column {{delta}}. Rows in S but 
> not in R are called insertions, and have delta=1. Rows that are in R but not 
> in S are called deletions, and have delta=-1.
>  * {{Patch(R, P)}} applies a difference {{P}} to a relation. {{P}} has the 
> same schema as {{R}} but with an additional delta column. The output is the 
> rows of R, minus the rows in {{P}} that have delta=-1, union the rows in 
> {{P}} that have delta=1.
> The relational operators are named by analogy with the UNIX utilities 
> {{diff}} and {{patch}}. (But {{Diff}}'s arguments are reversed compared to 
> {{diff}}.) Consider:
> {code}
> grep r /usr/share/dict/words  > r.txt
> grep s /usr/share/dict/words  > s.txt
> diff r.txt s.txt > p.patch
> patch -p1 r.txt < p.patch
> cmp r.txt s.txt.  # r and s are now identical
> {code}
> The relation {code}R = Patch(S, Diff(R, S)){code} always holds. {{Diff}} 
> computes the difference between R and S, and when {{Patch}} applies that 
> difference to {{S}} you get {{R}}.
> {{Patch}} and {{Diff}} are intended by be generalizations of set operators. 
> If there are only additions, i.e. {{R}} is a superset of {{S}}, then you can 
> substitute {{Minus}} for {{Diff}}, and {{Union}} for {{Patch}}:
> {code}
> R = Union(S, Minus(R, S))
> {code}
> So, how do these relational operators solve our original problem?
> An {{INSERT}} statement is assignment to a relation of itself union some new 
> rows. ({{INSERT INTO R SELECT * FROM P}} is {{R &larr; R UNION P}}).
> A {{MERGE}} statement is similar to {{INSERT}} but allows updates.
> The {{MERGE}} query above is represented as {code}EmpsByDeptno &larr; 
> Patch(EmpsByDeptno, Diff(NewEmpsByDeptno, EmpsByDeptno)){code}
> Lastly we need relational transformation rules to optimize the expressions 
> into per-key updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to