dengzhhu653 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3581136612
I think we can make the `aggrStatsUseDB` be simpler,
1. run the below SQL batched by part names:
```java
String query = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
// The following data is used to compute a partitioned table's NDV
based
// on partitions' NDV when useDensityFunctionForNDVEstimation =
true. Global NDVs cannot be
// accurately derived from partition NDVs, because the domain of
column value two partitions
// can overlap. If there is no overlap then global NDV is just the
sum
// of partition NDVs (UpperBound). But if there is some overlay then
// global NDV can be anywhere between sum of partition NDVs (no
overlap)
// and same as one of the partition NDV (domain of column value in
all other
// partitions is subset of the domain value in one of the partition)
// (LowerBound).But under uniform distribution, we can roughly
estimate the global
// NDV by leveraging the min/max values.
// And, we also guarantee that the estimation makes sense by
comparing it to the
// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
+
"sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as
decimal)),"
+
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
+ "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
+ "count(PARTITIONS + ".\"PART_ID\"),"
+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + "" + "
inner join " + PARTITIONS + " on "
+ PART_COL_STATS + ".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\"" +
" inner join " + TBLS + " on " + PARTITIONS
+ ".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\"" + " inner join " + DBS + "
on " + TBLS + ".\"DB_ID\" = " + DBS
+ ".\"DB_ID\"" + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS
+ ".\"TBL_NAME\" = ? and \"COLUMN_NAME\" in (%1$s)" + " and " +
PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and \"ENGINE\" = ? " + " group by \"COLUMN_NAME\",
\"COLUMN_TYPE\"";
```
2. For each batch, merge the returned rows as we does in this PR:
```java
try {
List<Object[]> list = Batchable.runBatched(batchSize, partNames, b);
Map<String, List<Object[]>> colSubList = columnWiseSubList(list);
for (Map.Entry<String, List<Object[]>> entry :
colSubList.entrySet()) {
colStats.add(columnStatisticsObjWithAdjustedNDV(entry.getValue(),
0, useDensityFunctionForNDVEstimation, ndvTuner));
Deadline.checkTimeout();
}
} finally {
b.closeAllQueries();
}
```
3. for each `ColumnStatisticsObj` got from step 2, change the `sum` by
`sum * partNames.size() / count(PARTITIONS + ".\"PART_ID\") for this column`
By this, we can make the `aggrStatsUseDB` simple and clear, and don't send
the unnecessary queries back and forth to backing db.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]