zhangbutao commented on code in PR #5400: URL: https://github.com/apache/hive/pull/5400#discussion_r1730863897
########## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java: ########## @@ -613,7 +616,7 @@ public boolean canComputeQueryUsingStats(org.apache.hadoop.hive.ql.metadata.Tabl } } } - return false; + return true; Review Comment: In case of delete files, if Iceberg statsSource can not compute query( eg. count(*)) using stats, i think HMS stats can't either. They both come from the same place -- `SnapshotSummary`. We should not give a wrong impression that HMS can give the accurate stats. ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/StatsOptimizer.java: ########## @@ -945,7 +945,10 @@ private Long getRowCnt( if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(tbl, tbl.getParameters())) { return null; } - rowCnt = Long.valueOf(tbl.getProperty(StatsSetupConst.ROW_COUNT)); + Map<String, String> basicStats = MetaStoreUtils.isNonNativeTable(tbl.getTTable()) ? + tbl.getStorageHandler().getBasicStatistics(tbl) : tbl.getParameters(); Review Comment: > I am not very certain how accurate is the TOTAL_RECORDS_PROP from the snapshot summary especially when there are deletes. Iceberg table with equal delets should not be optimized by the `count `query optimization, so we can skip to get the stats or return null in case of existing deletes. > yes, i think iceberg.hive.keep.stats should be enabled when stats source is not iceberg `iceberg.hive.keep.stats` should be always enabled(true) when stats source is iceberg. `iceberg.hive.keep.stats` true means that the stats is accurate. HMS stats for iceberg can be not accurate if some other engines(Spark、Trino) write the table but not update the HMS stats. But if the statsSource is iceberg, the stats is retrieved from iceberg `SnapshotSummary` which is real time and accurate. > could you please check the comments in #5215 and maybe incorporate changes from this PR into yours #5215 I want to optimize the `count `query with no care the values of `iceberg.hive.keep.stats`. But i am not do the limit like your change that `not use HMS stats when statsSource is Iceberg`. I am ok to incorporate this thange as well as some other supplements to #5215, we can continue to disscuss there. ########## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java: ########## @@ -600,7 +600,10 @@ private ColumnStatistics readColStats(Table table, Path statsPath) { @Override public boolean canComputeQueryUsingStats(org.apache.hadoop.hive.ql.metadata.Table hmsTable) { - if (getStatsSource().equals(HiveMetaHook.ICEBERG) && hmsTable.getMetaTable() == null) { + if (hmsTable.getMetaTable() != null) { Review Comment: Query against Iceberg Branch/Tag can also benefit from the stats. We can optimize this later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For additional commands, e-mail: gitbox-h...@hive.apache.org