> On Aug. 11, 2018, 7:45 p.m., Ashutosh Chauhan wrote: > > Isn't incremental rebuild always cheaper for Project-Filter-Join MVs since > > they are always insert only? If so, we don't need cost based decision > > there. > > Also can you remind an example for a MV containing aggregate where > > incremental rebuild via merge can be costlier? > > Jesús Camacho Rodríguez wrote: > bq. Isn't incremental rebuild always cheaper for Project-Filter-Join MVs > since they are always insert only? > Yes, it will always be cheaper. > > bq. If so, we don't need cost based decision there. > I just thought we preferred to make rewriting decisions cost-based > instead of using Hep. > > bq.Also can you remind an example for a MV containing aggregate where > incremental rebuild via merge can be costlier? > When there are many new rows and NDV for grouping columns is high: GBy > does not reduce the number of rows and MERGE may end up doing a lot of work > with OUTER JOIN + INSERT/UPDATE. > > > We can use HepPlanner for incremental rebuild (it needs a minor extension > in Calcite and it should mostly work). Then if a rewriting is produced, 1) > for Project-Filter-Join MVs we always use it, and 2) for > Project-Filter-Join-Aggregate MVs make use of the heuristic. > However, note that we will still need to introduce a parameter to be able > to tune the heuristic, right? > If that is the case, we may introduce Hep for Project-Filter-Join MVs in > a follow-up? > > Ashutosh Chauhan wrote: > From changes in q.out looks like before this patch rewriting wasn't > trigerred even for PFJ cases. Why would that be the case? In those cases > there are 2 candidate plans: one for full rebuild + onverwrite and another > for full build with additional predicate on writeId + insert into. This > Second plan should be cheaper because of additional predicates. Why didn't we > pick that before this patch? > > Jesús Camacho Rodríguez wrote: > The incremental rebuild works in two steps: 1) produce the partial > rewriting using the MV, and 2) transform rewriting into INSERT/MERGE > depending on whether the MV constains Aggregate or not. The costing is done > over the partial rewriting. That is Union(MV contents, PFJ of new data), and > in the case of containing Aggregate it is Agg(Union(MV contents, PFJA of new > data)). > > The cost of the union input using the MV is already reduced using > heuristics (we favour plans containing materialized views). However, the > other input to the union is cost as usual. In both cases (with and without > Aggregate), we may end up overestimating number of rows coming through that > input. If we estimate Filter condition over ROWID almost did not reduce input > number of rows, then it is easy to estimate that the Union rewriting will be > more expensive as new operators in the tree (e.g. additional Project to > remove that ROWID column or separate Filter operator for ROWID) will add to > the total cost because they need to process those rows. > > Without this patch, here are the two plans for the simple mv that you > mentioned (ignore cpu cost as that is only taken into account in case of draw > for the cardinality): > - FULL REBUILD: > HiveProject(key=[$0], value=[$1]) > HiveFilter(subset=[rel#2044:Subset#1.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], table=[[default, > src_txn]], table:alias=[src_txn]) > Total cost: {751.5 rows, 1253.5 cpu, 0.0 io} > > - PARTIAL REWRITING (INC REBUILD): > HiveUnion(all=[true]) > HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1]) > HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]], > condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) > HiveFilter(subset=[rel#2067:Subset#4.HIVE.[]], condition=[<(1, > $4.writeid)]) > HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], > table=[[default, src_txn]], table:alias=[src_txn]) > HiveProject(subset=[rel#2074:Subset#8.HIVE.[]], key=[$1], value=[$0]) > HiveTableScan(subset=[rel#2072:Subset#7.HIVE.[]], table=[[default, > partition_mv_1]], table:alias=[default.partition_mv_1]) > Total cost: {876.7522762499999 rows, 1378.75283625 cpu, 0.0 io} > > (Btw, I can enable FilterMerge rule in the same loop as the MV rewriting, > but that will still not change outcome in many cases -Project for ROWID will > still add overhead- and will add to the optimization time).
The second one (it was reformatted): HiveUnion(all=[true]) HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1]) HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]], condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))]) HiveFilter(subset=[rel#2067:Subset#4.HIVE.[]], condition=[<(1, $4.writeid)]) HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], table=[[default, src_txn]], table:alias=[src_txn]) HiveProject(subset=[rel#2074:Subset#8.HIVE.[]], key=[$1], value=[$0]) HiveTableScan(subset=[rel#2072:Subset#7.HIVE.[]], table=[[default, partition_mv_1]], table:alias=[default.partition_mv_1]) - Jesús ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68261/#review207113 ----------------------------------------------------------- On Aug. 8, 2018, 3:39 p.m., Jesús Camacho Rodríguez wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68261/ > ----------------------------------------------------------- > > (Updated Aug. 8, 2018, 3:39 p.m.) > > > Review request for hive and Ashutosh Chauhan. > > > Bugs: HIVE-20332 > https://issues.apache.org/jira/browse/HIVE-20332 > > > Repository: hive-git > > > Description > ------- > > HIVE-20332 > > > Diffs > ----- > > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java > 5bdcac88d0015d2410da050524e6697a22d83eb9 > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveDefaultRelMetadataProvider.java > 635d27e723dc1d260574723296f3484c26106a9c > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveMaterializedViewsRelMetadataProvider.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java > 43f8508ffbf4ba3cc46016e1d300d6ca9c2e8ccb > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdCumulativeCost.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdDistinctRowCount.java > 80b939a9f65142baa149b79460b753ddf469aacf > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSelectivity.java > 575902d78de2a7f95585c23a3c2fc03b9ce89478 > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSize.java > 97097381d9619e67bcab8a268d571d2a392485b3 > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdUniqueKeys.java > 3bf62c535cec1e7a3eac43f0ce40879dbfc89799 > ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java > 361f150193a155d45eb64266f88eb88f0a881ad3 > ql/src/test/results/clientpositive/llap/materialized_view_partitioned.q.out > b12df11a98e55c00c8b77e8292666373f3509364 > ql/src/test/results/clientpositive/llap/materialized_view_rebuild.q.out > 4d37d82b6e1f3d4ab8b76c391fa94176356093c2 > > > Diff: https://reviews.apache.org/r/68261/diff/2/ > > > Testing > ------- > > > Thanks, > > Jesús Camacho Rodríguez > >