Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

via GitHub Wed, 26 Nov 2025 04:37:59 -0800


dengzhhu653 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3581136612


   I think we can make the `aggrStatsUseDB` be simpler, 
    1. run the below SQL batched by part names:
    ```java
       String query  = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
           + "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
           + "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
           + "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
           + "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
           // The following data is used to compute a partitioned table's NDV 
based
           // on partitions' NDV when useDensityFunctionForNDVEstimation = 
true. Global NDVs cannot be
           // accurately derived from partition NDVs, because the domain of 
column value two partitions
           // can overlap. If there is no overlap then global NDV is just the 
sum
           // of partition NDVs (UpperBound). But if there is some overlay then
           // global NDV can be anywhere between sum of partition NDVs (no 
overlap)
           // and same as one of the partition NDV (domain of column value in 
all other
           // partitions is subset of the domain value in one of the partition)
           // (LowerBound).But under uniform distribution, we can roughly 
estimate the global
           // NDV by leveraging the min/max values.
           // And, we also guarantee that the estimation makes sense by 
comparing it to the
           // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
           // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
           + 
"sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"
           + 
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
           + "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
           + "count(PARTITIONS + ".\"PART_ID\"),"
           + "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + "" + " 
inner join " + PARTITIONS + " on "
           + PART_COL_STATS + ".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\"" + 
" inner join " + TBLS + " on " + PARTITIONS
           + ".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\"" + " inner join " + DBS + " 
on " + TBLS + ".\"DB_ID\" = " + DBS
           + ".\"DB_ID\"" + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS
           + ".\"TBL_NAME\" = ?  and \"COLUMN_NAME\" in (%1$s)" + " and " + 
PARTITIONS + ".\"PART_NAME\" in (%2$s)"
             + " and \"ENGINE\" = ? " + " group by \"COLUMN_NAME\", 
\"COLUMN_TYPE\"";
   ```
      2. For each batch, merge the returned rows as we does in this PR:
      ```java
     try {
           List<Object[]> list = Batchable.runBatched(batchSize, partNames, b);
           Map<String, List<Object[]>> colSubList = columnWiseSubList(list);
           for (Map.Entry<String, List<Object[]>> entry : 
colSubList.entrySet()) {
             colStats.add(columnStatisticsObjWithAdjustedNDV(entry.getValue(), 
0, useDensityFunctionForNDVEstimation, ndvTuner));
             Deadline.checkTimeout();
           }
         } finally {
           b.closeAllQueries();
         }
   ```
    3.  for each `ColumnStatisticsObj` got from step 2, change the `sum` by 
`sum * partNames.size() / count(PARTITIONS + ".\"PART_ID\") for this column`
    
   By this, we can make the  `aggrStatsUseDB` simple and clear, and don't send 
the unnecessary queries back and forth to backing db.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

Reply via email to