deniskuzZ commented on code in PR #6525:
URL: https://github.com/apache/hive/pull/6525#discussion_r3403707839
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -1993,6 +1982,15 @@ public static boolean checkCanProvideColumnStats(Table
table) {
return !table.isNonNative() ||
table.getStorageHandler().canProvideColStatistics(table);
}
+ /**
+ * Whether a table's statistics may be used for plan-shape optimizations
such as semijoin
+ * reduction or map-join conversion, where relying on stale stats only
affects performance,
+ * never correctness.
+ */
+ public static boolean checkCanProvideStatsForOpt(Table table) {
+ return checkCanProvideStats(table) ||
StatsSetupConst.areBasicStatsUptoDate(table.getParameters());
Review Comment:
This change only affects a planning-time optimization and does not affect
query correctness. If statistics on an external table become stale because a
third-party tool added data directly to the table location, the worst-case
outcome is that the optimizer makes a suboptimal decision. The query will still
return correct results.
External tables are also a very common deployment model in Hive. Many
customers routinely run ANALYZE TABLE COMPUTE STATS on external tables
specifically to improve query performance. As with any other statistics-based
optimization, users who want the optimizer to make good decisions are expected
to keep statistics reasonably up to date.
So I'm not sure what additional risk this change introduces beyond the
existing and well-understood behavior of statistics-based optimization in Hive.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]