[jira] [Work logged] (HIVE-23684) Large underestimation in NDV stats when input and join cardinality ratio is big

ASF GitHub Bot (Jira) Wed, 16 Dec 2020 12:39:05 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-23684?focusedWorklogId=525252&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-525252
 ]


ASF GitHub Bot logged work on HIVE-23684:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Dec/20 20:38
            Start Date: 16/Dec/20 20:38
    Worklog Time Spent: 10m 
      Work Description: jcamachor commented on a change in pull request #1786:
URL: https://github.com/apache/hive/pull/1786#discussion_r544603006



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2686,6 +2686,8 @@ private static void populateLlapDaemonVarsSet(Set<String> 
llapDaemonVarsSetLocal
         "Estimate statistics in absence of statistics."),
     HIVE_STATS_NDV_ESTIMATE_PERC("hive.stats.ndv.estimate.percent", (float)20,
         "This many percentage of rows will be estimated as count distinct in 
absence of statistics."),
+    HIVE_STATS_JOIN_NDV_READJUSTMENT("hive.stats.join.ndv.readjustment", false,
+        "Setting this to true will make Hive use Calcite to adjust 
estimatation for ndv after join."),

Review comment:
       `Setting this to true will make Hive use Calcite`
   
   Instead of 'Calcite' logic, could you maybe mention (one-liner since it is a 
config and there is no need of it to be too long) the kind of estimation it 
does?

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##########
@@ -2546,19 +2547,25 @@ private void updateColStats(HiveConf conf, Statistics 
stats, long leftUnmatchedR
       for (ColStatistics cs : colStats) {
         colNameStatsAvailable.add(cs.getColumnName());
         int pos = jop.getConf().getReversedExprs().get(cs.getColumnName());
-        long oldRowCount = rowCountParents.get(pos);
-        double ratio = (double) newNumRows / (double) oldRowCount;
         long oldDV = cs.getCountDistint();
+
+        boolean useCalciteForNdvReadjustment
+            = HiveConf.getBoolVar(conf, 
ConfVars.HIVE_STATS_JOIN_NDV_READJUSTMENT);
         long newDV = oldDV;
+        if (useCalciteForNdvReadjustment) {
+          newDV = RelMdUtil.numDistinctVals(oldDV * 1.0, newNumRows * 
1.0).longValue();

Review comment:
       Can `RelMdUtil.numDistinctVals` return null? Just making sure we do not 
need a null check.

##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2686,6 +2686,8 @@ private static void populateLlapDaemonVarsSet(Set<String> 
llapDaemonVarsSetLocal
         "Estimate statistics in absence of statistics."),
     HIVE_STATS_NDV_ESTIMATE_PERC("hive.stats.ndv.estimate.percent", (float)20,
         "This many percentage of rows will be estimated as count distinct in 
absence of statistics."),
+    HIVE_STATS_JOIN_NDV_READJUSTMENT("hive.stats.join.ndv.readjustment", false,
+        "Setting this to true will make Hive use Calcite to adjust 
estimatation for ndv after join."),

Review comment:
       typo: `estimatation`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 525252)
    Time Spent: 20m  (was: 10m)

> Large underestimation in NDV stats when input and join cardinality ratio is 
> big
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-23684
>                 URL: https://issues.apache.org/jira/browse/HIVE-23684
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Large underestimations of NDV values may occur after a join operation since 
> the current logic will decrease the original NDV values proportionally.
> The 
> [code|https://github.com/apache/hive/blob/1271d08a3c51c021fa710449f8748b8cdb12b70f/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L2558]
>  compares the number of rows of each relation before the join with the number 
> of rows after the join and extracts a ratio for each side. Based on this 
> ratio it adapts (reduces) the NDV accordingly.
> Consider for instance the following query:
> {code:sql}
> select inv_warehouse_sk
>      , inv_item_sk
>      , stddev_samp(inv_quantity_on_hand) stdev
>      , avg(inv_quantity_on_hand)         mean
> from inventory
>    , date_dim
> where inv_date_sk = d_date_sk
>   and d_year = 1999
>   and d_moy = 2
> group by inv_warehouse_sk, inv_item_sk;
> {code}
> For the sake of the discussion, I outline below some relevant stats (from 
> TPCDS30tb):
>  T(inventory) = 1627857000
>  T(date_dim) = 73049
>  T(inventory JOIN date_dim[d_year=1999 AND d_moy=2]) = 24948000
>  V(inventory, inv_date_sk) = 261
>  V(inventory, inv_item_sk) = 420000
>  V(inventory, inv_warehouse_sk) = 27
>  V(date_dim, inv, d_date_sk) = 73049
> For instance, in this query the join between inventory and date_dim has ~24M 
> rows while inventory has ~1.5B so the NDV of the columns coming from 
> inventory are reduced by a factor of ~100 so we end up with V(JOIN, 
> inv_item_sk) = ~6K while the real one is 231000.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23684) Large underestimation in NDV stats when input and join cardinality ratio is big

Reply via email to