[jira] [Commented] (CALCITE-4707) Optimize incremental maintenance of materialized views

Jesus Camacho Rodriguez (Jira) Tue, 03 Aug 2021 20:27:06 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-4707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392688#comment-17392688
 ]


Jesus Camacho Rodriguez commented on CALCITE-4707:
--------------------------------------------------

Thanks [~julianhyde]. Fwiw a rather similar approach to what is described above 
is implemented in Hive using Calcite: It relies on the snapshot information of 
the tables referenced by the materialized view to generate the diffs. It 
proceeds in two steps: 1) Creating the union rewriting referenced above by 
relying on Calcite's MV rewriting algorithm, and 2) Trigger a Calcite rule to 
transform the union rewriting into an insert/merge statement. Slides 19-24 
[here|https://www.slideshare.net/Hadoop_Summit/accelerating-query-processing-with-materialized-views-in-apache-hive]
 describe it with an example. Unfortunately, the rules for step 2 are quite 
specific to Hive (see .\*incremental.\* rules 
[here|https://github.com/apache/hive/tree/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/views])
 since they generate the internal execution plan for a MERGE statement, i.e., 
what MERGE translates into in Hive (outer join + multi-insert statement) rather 
than a MERGE operator. Maybe some of the code in Hive could be reused here 
rather than reimplemented (it would definitely be interesting if Hive could 
adopt the Calcite implementation). Cc [~kkasa], since you have been working in 
this space in Hive.

> Optimize incremental maintenance of materialized views
> ------------------------------------------------------
>
>                 Key: CALCITE-4707
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4707
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Priority: Major
>
> Optimize incremental maintenance of materialized views, when we know what DML 
> has occurred since the last build.
> The goal here is to develop an algebraic approach (i.e. new relational 
> operators and transformation rules) that will generate an optimal query for 
> maintaining the view.
> Consider a materialized view
> {code}
> CREATE TABLE EmpsByDeptno AS
>   SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
>   FROM Emps
>   GROUP BY deptno;
> {code}
> We built it at time T1, when the value of {{Emps}} was {{Emps1}}, and it had 
> the value {{EmpsByDeptno1}}.
> It is now T2, and since then, new employees have been created in the 
> {{NewEmps}} table. Thus {{Emps2 = Emps1 UNION ALL NewEmps}}. We could build a 
> new MV,
> {code}
> CREATE TABLE EmpsByDeptno2 AS
>   SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
>   FROM (SELECT * FROM Emps1
>       UNION ALL
>       SELECT * FROM NewEmps)
>   GROUP BY deptno;
> {code}but we would prefer to generate a DML command to modify 
> {{EmpsByDeptno}} in place:{code}
> MERGE INTO EmpsByDeptno AS e
>   USING (SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
>       FROM NewEmps) AS de
>   ON de.deptno = e.deptno
>   WHEN MATCHED THEN
>     UPDATE SET c = c + de.c, ss = ss + de.ss
>   WHEN NOT MATCHED THEN
>     INSERT (deptno, c, ss)
> {code}
> We propose two new relational operators:
>  * {{Diff(S, R)}} computes the difference between two relations. It has a 
> same schema as R and S but with an additional column {{delta}}. Rows in S but 
> not in R are called insertions, and have delta=1. Rows that are in R but not 
> in S are called deletions, and have delta=-1.
>  * {{Patch(R, P)}} applies a difference {{P}} to a relation. {{P}} has the 
> same schema as {{R}} but with an additional delta column. The output is the 
> rows of R, minus the rows in {{P}} that have delta=-1, union the rows in 
> {{P}} that have delta=1.
> The relational operators are named by analogy with the UNIX utilities 
> {{diff}} and {{patch}}. (But {{Diff}}'s arguments are reversed compared to 
> {{diff}}.) Consider:
> {code}
> grep r /usr/share/dict/words  > r.txt
> grep s /usr/share/dict/words  > s.txt
> diff r.txt s.txt > p.patch
> patch -p1 r.txt < p.patch
> cmp r.txt s.txt.  # r and s are now identical
> {code}
> The relation {code}R = Patch(S, Diff(R, S)){code} always holds. {{Diff}} 
> computes the difference between R and S, and when {{Patch}} applies that 
> difference to {{S}} you get {{R}}.
> {{Patch}} and {{Diff}} are intended by be generalizations of set operators. 
> If there are only additions, i.e. {{R}} is a superset of {{S}}, then you can 
> substitute {{Minus}} for {{Diff}}, and {{Union}} for {{Patch}}:
> {code}
> R = Union(S, Minus(R, S))
> {code}
> So, how do these relational operators solve our original problem?
> An {{INSERT}} statement is assignment to a relation of itself union some new 
> rows. ({{INSERT INTO R SELECT * FROM P}} is {{R &larr; R UNION P}}).
> A {{MERGE}} statement is similar to {{INSERT}} but allows updates.
> The {{MERGE}} query above is represented as {code}EmpsByDeptno &larr; 
> Patch(EmpsByDeptno, Diff(NewEmpsByDeptno, EmpsByDeptno)){code}
> Lastly we need relational transformation rules to optimize the expressions 
> into per-key updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (CALCITE-4707) Optimize incremental maintenance of materialized views

Reply via email to