kgyrtkirk commented on a change in pull request #1531:
URL: https://github.com/apache/hive/pull/1531#discussion_r500976202



##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, 
NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the 
right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right 
branch.
+   * The join has one-to-many relationship since UDTF can generate multiple 
rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) 
and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule 
implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = 
lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but 
actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = 
parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = 
parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) 
selectStats.getNumRows();

Review comment:
       I know `selectStats.getNumRows()` should not be zero - but just in 
case... could you also add the resulting logic as `StatsUtils` or something 
like that? 

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, 
NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the 
right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right 
branch.
+   * The join has one-to-many relationship since UDTF can generate multiple 
rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) 
and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule 
implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = 
lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but 
actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = 
parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = 
parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) 
selectStats.getNumRows();
+      final long selectDataSize = 
StatsUtils.safeMult(selectStats.getDataSize(), factor);
+      final long dataSize = StatsUtils.safeAdd(selectDataSize, 
udtfStats.getDataSize());
+      Statistics joinedStats = new Statistics(udtfStats.getNumRows(), 
dataSize, 0, 0);
+
+      if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+        final Map<String, ExprNodeDesc> columnExprMap = lop.getColumnExprMap();
+        final RowSchema schema = lop.getSchema();
+
+        joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+        final List<ColStatistics> selectColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, selectStats, columnExprMap, 
schema);
+        joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+        joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+        final List<ColStatistics> udtfColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, 
schema);
+        joinedStats.addToColumnStats(udtfColStats);
+
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), 
joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[0] STATS-" + lop.toString() + ": " + 
joinedStats.extendedToString());
+        }

Review comment:
       this seems to be a common expression in both branches of the `if` - 
could you move it outside?

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack<Node> stack, 
NodeProcessorCtx procCtx,
     }
   }
 
+  /**
+   * LateralViewJoinOperator changes the data size and column level statistics.
+   *
+   * A diagram of LATERAL VIEW.
+   *
+   *   [Lateral View Forward]
+   *          /     \
+   *    [Select]  [Select]
+   *        |        |
+   *        |     [UDTF]
+   *        \       /
+   *   [Lateral View Join]
+   *
+   * For each row of the source, the left branch just picks columns and the 
right branch processes UDTF.
+   * And then LVJ joins a row from the left branch with rows from the right 
branch.
+   * The join has one-to-many relationship since UDTF can generate multiple 
rows.
+   *
+   * This rule multiplies the stats from the left branch by T(right) / T(left) 
and sums up the both sides.
+   */
+  public static class LateralViewJoinStatsRule extends DefaultStatsRule 
implements SemanticNodeProcessor {
+    @Override
+    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx,
+                          Object... nodeOutputs) throws SemanticException {
+      final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd;
+      final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx;
+      final HiveConf conf = aspCtx.getConf();
+
+      if (!isAllParentsContainStatistics(lop)) {
+        return null;
+      }
+
+      final List<Operator<? extends OperatorDesc>> parents = 
lop.getParentOperators();
+      if (parents.size() != 2) {
+        LOG.warn("LateralViewJoinOperator should have just two parents but 
actually has "
+                + parents.size() + " parents.");
+        return null;
+      }
+
+      final Statistics selectStats = 
parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics();
+      final Statistics udtfStats = 
parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics();
+
+      final double factor = (double) udtfStats.getNumRows() / (double) 
selectStats.getNumRows();
+      final long selectDataSize = 
StatsUtils.safeMult(selectStats.getDataSize(), factor);
+      final long dataSize = StatsUtils.safeAdd(selectDataSize, 
udtfStats.getDataSize());
+      Statistics joinedStats = new Statistics(udtfStats.getNumRows(), 
dataSize, 0, 0);
+
+      if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) {
+        final Map<String, ExprNodeDesc> columnExprMap = lop.getColumnExprMap();
+        final RowSchema schema = lop.getSchema();
+
+        joinedStats.updateColumnStatsState(selectStats.getColumnStatsState());
+        final List<ColStatistics> selectColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, selectStats, columnExprMap, 
schema);
+        joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor));
+
+        joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState());
+        final List<ColStatistics> udtfColStats = StatsUtils
+                .getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, 
schema);
+        joinedStats.addToColumnStats(udtfColStats);
+
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), 
joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[0] STATS-" + lop.toString() + ": " + 
joinedStats.extendedToString());
+        }
+      } else {
+        joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), 
joinedStats, lop);
+        lop.setStatistics(joinedStats);
+
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("[1] STATS-" + lop.toString() + ": " + 
joinedStats.extendedToString());
+        }
+      }
+      return null;
+    }
+
+    private List<ColStatistics> multiplyColStats(List<ColStatistics> 
colStatistics, double factor) {
+      for (ColStatistics colStats : colStatistics) {
+        colStats.setNumFalses(StatsUtils.safeMult(colStats.getNumFalses(), 
factor));
+        colStats.setNumTrues(StatsUtils.safeMult(colStats.getNumTrues(), 
factor));
+        colStats.setNumNulls(StatsUtils.safeMult(colStats.getNumNulls(), 
factor));
+        // When factor > 1, the same records are duplicated and countDistinct 
never changes.
+        if (factor < 1.0) {
+          
colStats.setCountDistint(StatsUtils.safeMult(colStats.getCountDistint(), 
factor));

Review comment:
       I think we should make sure that NDV is at least 1 in case numrows is >0

##########
File path: 
ql/src/test/results/clientpositive/llap/annotate_stats_lateral_view_join.q.out
##########
@@ -503,14 +503,14 @@ STAGE PLANS:
                             Statistics: Num rows: 1 Data size: 376 Basic 
stats: COMPLETE Column stats: COMPLETE
                             Lateral View Join Operator
                               outputColumnNames: _col0, _col1, _col5, _col6
-                              Statistics: Num rows: 0 Data size: 24 Basic 
stats: PARTIAL Column stats: NONE
+                              Statistics: Num rows: 0 Data size: 24 Basic 
stats: PARTIAL Column stats: COMPLETE

Review comment:
       definetly - I don't think it will be `0` in reality!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

Reply via email to