Re: Review Request 68261: HIVE-20332

Jesús Camacho Rodríguez Sat, 11 Aug 2018 21:00:20 -0700


> On Aug. 11, 2018, 7:45 p.m., Ashutosh Chauhan wrote:
> > Isn't incremental rebuild always cheaper for Project-Filter-Join MVs since 
> > they are always insert only? If so, we don't need cost based decision 
> > there. 
> > Also can you remind an  example for a MV containing aggregate where 
> > incremental rebuild via merge can be costlier?
> 
> Jesús Camacho Rodríguez wrote:
>     bq. Isn't incremental rebuild always cheaper for Project-Filter-Join MVs 
> since they are always insert only?
>     Yes, it will always be cheaper.
>     
>     bq. If so, we don't need cost based decision there. 
>     I just thought we preferred to make rewriting decisions cost-based 
> instead of using Hep.
>     
>     bq.Also can you remind an  example for a MV containing aggregate where 
> incremental rebuild via merge can be costlier?
>     When there are many new rows and NDV for grouping columns is high: GBy 
> does not reduce the number of rows and MERGE may end up doing a lot of work 
> with OUTER JOIN + INSERT/UPDATE.
>     
>     
>     We can use HepPlanner for incremental rebuild (it needs a minor extension 
> in Calcite and it should mostly work). Then if a rewriting is produced, 1) 
> for Project-Filter-Join MVs we always use it, and 2) for 
> Project-Filter-Join-Aggregate MVs make use of the heuristic.
>     However, note that we will still need to introduce a parameter to be able 
> to tune the heuristic, right?
>     If that is the case, we may introduce Hep for Project-Filter-Join MVs in 
> a follow-up?
> 
> Ashutosh Chauhan wrote:
>     From changes in q.out looks like before this patch rewriting wasn't 
> trigerred even for PFJ cases. Why would that be the case? In those cases 
> there are 2 candidate plans: one for full rebuild + onverwrite  and another 
> for full build with additional predicate on writeId + insert into. This 
> Second plan should be cheaper because of additional predicates. Why didn't we 
> pick that before this patch?
> 
> Jesús Camacho Rodríguez wrote:
>     The incremental rebuild works in two steps: 1) produce the partial 
> rewriting using the MV, and 2) transform rewriting into INSERT/MERGE 
> depending on whether the MV constains Aggregate or not. The costing is done 
> over the partial rewriting. That is Union(MV contents, PFJ of new data), and 
> in the case of containing Aggregate it is Agg(Union(MV contents, PFJA of new 
> data)).
>     
>     The cost of the union input using the MV is already reduced using 
> heuristics (we favour plans containing materialized views). However, the 
> other input to the union is cost as usual. In both cases (with and without 
> Aggregate), we may end up overestimating number of rows coming through that 
> input. If we estimate Filter condition over ROWID almost did not reduce input 
> number of rows, then it is easy to estimate that the Union rewriting will be 
> more expensive as new operators in the tree (e.g. additional Project to 
> remove that ROWID column or separate Filter operator for ROWID) will add to 
> the total cost because they need to process those rows.
>     
>     Without this patch, here are the two plans for the simple mv that you 
> mentioned (ignore cpu cost as that is only taken into account in case of draw 
> for the cardinality):
>     - FULL REBUILD: 
>     HiveProject(key=[$0], value=[$1])
>       HiveFilter(subset=[rel#2044:Subset#1.HIVE.[]], 
> condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))])
>         HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], table=[[default, 
> src_txn]], table:alias=[src_txn])
>     Total cost: {751.5 rows, 1253.5 cpu, 0.0 io}
>     
>     - PARTIAL REWRITING (INC REBUILD):
>     HiveUnion(all=[true])
>       HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1])
>         HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]], 
> condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))])
>           HiveFilter(subset=[rel#2067:Subset#4.HIVE.[]], condition=[<(1, 
> $4.writeid)])
>             HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], 
> table=[[default, src_txn]], table:alias=[src_txn])
>       HiveProject(subset=[rel#2074:Subset#8.HIVE.[]], key=[$1], value=[$0])
>         HiveTableScan(subset=[rel#2072:Subset#7.HIVE.[]], table=[[default, 
> partition_mv_1]], table:alias=[default.partition_mv_1])
>     Total cost: {876.7522762499999 rows, 1378.75283625 cpu, 0.0 io}
>     
>     (Btw, I can enable FilterMerge rule in the same loop as the MV rewriting, 
> but that will still not change outcome in many cases -Project for ROWID will 
> still add overhead- and will add to the optimization time).


The second one (it was reformatted):

HiveUnion(all=[true])
  HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1])
    HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]], 
condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))])
      HiveFilter(subset=[rel#2067:Subset#4.HIVE.[]], condition=[<(1, 
$4.writeid)])
        HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], table=[[default, 
src_txn]], table:alias=[src_txn])
  HiveProject(subset=[rel#2074:Subset#8.HIVE.[]], key=[$1], value=[$0])
    HiveTableScan(subset=[rel#2072:Subset#7.HIVE.[]], table=[[default, 
partition_mv_1]], table:alias=[default.partition_mv_1])


- Jesús


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68261/#review207113
-----------------------------------------------------------


On Aug. 8, 2018, 3:39 p.m., Jesús Camacho Rodríguez wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68261/
> -----------------------------------------------------------
> 
> (Updated Aug. 8, 2018, 3:39 p.m.)
> 
> 
> Review request for hive and Ashutosh Chauhan.
> 
> 
> Bugs: HIVE-20332
>     https://issues.apache.org/jira/browse/HIVE-20332
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-20332
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 
> 5bdcac88d0015d2410da050524e6697a22d83eb9 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveDefaultRelMetadataProvider.java
>  635d27e723dc1d260574723296f3484c26106a9c 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveMaterializedViewsRelMetadataProvider.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
>  43f8508ffbf4ba3cc46016e1d300d6ca9c2e8ccb 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdCumulativeCost.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdDistinctRowCount.java
>  80b939a9f65142baa149b79460b753ddf469aacf 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSelectivity.java
>  575902d78de2a7f95585c23a3c2fc03b9ce89478 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSize.java
>  97097381d9619e67bcab8a268d571d2a392485b3 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdUniqueKeys.java
>  3bf62c535cec1e7a3eac43f0ce40879dbfc89799 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java 
> 361f150193a155d45eb64266f88eb88f0a881ad3 
>   ql/src/test/results/clientpositive/llap/materialized_view_partitioned.q.out 
> b12df11a98e55c00c8b77e8292666373f3509364 
>   ql/src/test/results/clientpositive/llap/materialized_view_rebuild.q.out 
> 4d37d82b6e1f3d4ab8b76c391fa94176356093c2 
> 
> 
> Diff: https://reviews.apache.org/r/68261/diff/2/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jesús Camacho Rodríguez
> 
>

Re: Review Request 68261: HIVE-20332

Reply via email to