[
https://issues.apache.org/jira/browse/CALCITE-4707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392688#comment-17392688
]
Jesus Camacho Rodriguez commented on CALCITE-4707:
--------------------------------------------------
Thanks [~julianhyde]. Fwiw a rather similar approach to what is described above
is implemented in Hive using Calcite: It relies on the snapshot information of
the tables referenced by the materialized view to generate the diffs. It
proceeds in two steps: 1) Creating the union rewriting referenced above by
relying on Calcite's MV rewriting algorithm, and 2) Trigger a Calcite rule to
transform the union rewriting into an insert/merge statement. Slides 19-24
[here|https://www.slideshare.net/Hadoop_Summit/accelerating-query-processing-with-materialized-views-in-apache-hive]
describe it with an example. Unfortunately, the rules for step 2 are quite
specific to Hive (see .\*incremental.\* rules
[here|https://github.com/apache/hive/tree/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/views])
since they generate the internal execution plan for a MERGE statement, i.e.,
what MERGE translates into in Hive (outer join + multi-insert statement) rather
than a MERGE operator. Maybe some of the code in Hive could be reused here
rather than reimplemented (it would definitely be interesting if Hive could
adopt the Calcite implementation). Cc [~kkasa], since you have been working in
this space in Hive.
> Optimize incremental maintenance of materialized views
> ------------------------------------------------------
>
> Key: CALCITE-4707
> URL: https://issues.apache.org/jira/browse/CALCITE-4707
> Project: Calcite
> Issue Type: Bug
> Reporter: Julian Hyde
> Priority: Major
>
> Optimize incremental maintenance of materialized views, when we know what DML
> has occurred since the last build.
> The goal here is to develop an algebraic approach (i.e. new relational
> operators and transformation rules) that will generate an optimal query for
> maintaining the view.
> Consider a materialized view
> {code}
> CREATE TABLE EmpsByDeptno AS
> SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
> FROM Emps
> GROUP BY deptno;
> {code}
> We built it at time T1, when the value of {{Emps}} was {{Emps1}}, and it had
> the value {{EmpsByDeptno1}}.
> It is now T2, and since then, new employees have been created in the
> {{NewEmps}} table. Thus {{Emps2 = Emps1 UNION ALL NewEmps}}. We could build a
> new MV,
> {code}
> CREATE TABLE EmpsByDeptno2 AS
> SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
> FROM (SELECT * FROM Emps1
> UNION ALL
> SELECT * FROM NewEmps)
> GROUP BY deptno;
> {code}but we would prefer to generate a DML command to modify
> {{EmpsByDeptno}} in place:{code}
> MERGE INTO EmpsByDeptno AS e
> USING (SELECT deptno, COUNT(*) AS c, SUM(sal) AS ss
> FROM NewEmps) AS de
> ON de.deptno = e.deptno
> WHEN MATCHED THEN
> UPDATE SET c = c + de.c, ss = ss + de.ss
> WHEN NOT MATCHED THEN
> INSERT (deptno, c, ss)
> {code}
> We propose two new relational operators:
> * {{Diff(S, R)}} computes the difference between two relations. It has a
> same schema as R and S but with an additional column {{delta}}. Rows in S but
> not in R are called insertions, and have delta=1. Rows that are in R but not
> in S are called deletions, and have delta=-1.
> * {{Patch(R, P)}} applies a difference {{P}} to a relation. {{P}} has the
> same schema as {{R}} but with an additional delta column. The output is the
> rows of R, minus the rows in {{P}} that have delta=-1, union the rows in
> {{P}} that have delta=1.
> The relational operators are named by analogy with the UNIX utilities
> {{diff}} and {{patch}}. (But {{Diff}}'s arguments are reversed compared to
> {{diff}}.) Consider:
> {code}
> grep r /usr/share/dict/words > r.txt
> grep s /usr/share/dict/words > s.txt
> diff r.txt s.txt > p.patch
> patch -p1 r.txt < p.patch
> cmp r.txt s.txt. # r and s are now identical
> {code}
> The relation {code}R = Patch(S, Diff(R, S)){code} always holds. {{Diff}}
> computes the difference between R and S, and when {{Patch}} applies that
> difference to {{S}} you get {{R}}.
> {{Patch}} and {{Diff}} are intended by be generalizations of set operators.
> If there are only additions, i.e. {{R}} is a superset of {{S}}, then you can
> substitute {{Minus}} for {{Diff}}, and {{Union}} for {{Patch}}:
> {code}
> R = Union(S, Minus(R, S))
> {code}
> So, how do these relational operators solve our original problem?
> An {{INSERT}} statement is assignment to a relation of itself union some new
> rows. ({{INSERT INTO R SELECT * FROM P}} is {{R ← R UNION P}}).
> A {{MERGE}} statement is similar to {{INSERT}} but allows updates.
> The {{MERGE}} query above is represented as {code}EmpsByDeptno ←
> Patch(EmpsByDeptno, Diff(NewEmpsByDeptno, EmpsByDeptno)){code}
> Lastly we need relational transformation rules to optimize the expressions
> into per-key updates.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)