Re: [PR] Iceberg: Do not use HMS stats when statsSource is Iceberg [hive]

via GitHub Mon, 26 Aug 2024 01:43:40 -0700


zhangbutao commented on code in PR #5400:
URL: https://github.com/apache/hive/pull/5400#discussion_r1730863897



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##########
@@ -613,7 +616,7 @@ public boolean 
canComputeQueryUsingStats(org.apache.hadoop.hive.ql.metadata.Tabl
         }
       }
     }
-    return false;
+    return true;

Review Comment:
   In case of delete files, if Iceberg statsSource can not compute query( eg. 
count(*)) using stats, i think HMS stats can't either. They both come from the 
same place -- `SnapshotSummary`.
   
   We should not give a wrong impression that HMS can give the accurate stats.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/StatsOptimizer.java:
##########
@@ -945,7 +945,10 @@ private Long getRowCnt(
         if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(tbl, 
tbl.getParameters())) {
           return null;
         }
-        rowCnt = Long.valueOf(tbl.getProperty(StatsSetupConst.ROW_COUNT));
+        Map<String, String> basicStats = 
MetaStoreUtils.isNonNativeTable(tbl.getTTable()) ?
+          tbl.getStorageHandler().getBasicStatistics(tbl) : 
tbl.getParameters();

Review Comment:
   > I am not very certain how accurate is the TOTAL_RECORDS_PROP from the 
snapshot summary especially when there are deletes.
   
   Iceberg table with equal delets should not be optimized by the `count `query 
optimization, so we can skip to get the stats or return null in case of 
existing deletes.
   
   > yes, i think iceberg.hive.keep.stats should be enabled when stats source 
is not iceberg
   
   `iceberg.hive.keep.stats` should be always enabled(true) when stats source 
is iceberg.  `iceberg.hive.keep.stats` true means that the stats is accurate. 
   HMS stats for iceberg can be not accurate if some other engines(Spark、Trino) 
write the table but not update the HMS stats.
   But if the statsSource is iceberg, the stats is retrieved from iceberg 
`SnapshotSummary` which is real time and accurate.
   
   > could you please check the comments in #5215 and maybe incorporate changes 
from this PR into yours
   
   #5215 I want to optimize the `count `query with no care the values of  
`iceberg.hive.keep.stats`. But i am not do the limit like your change that `not 
use HMS stats when statsSource is Iceberg`.  I am ok to incorporate this thange 
as well as some other supplements to #5215, we can continue to disscuss there.



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##########
@@ -600,7 +600,10 @@ private ColumnStatistics readColStats(Table table, Path 
statsPath) {
 
   @Override
   public boolean 
canComputeQueryUsingStats(org.apache.hadoop.hive.ql.metadata.Table hmsTable) {
-    if (getStatsSource().equals(HiveMetaHook.ICEBERG) && 
hmsTable.getMetaTable() == null) {
+    if (hmsTable.getMetaTable() != null) {

Review Comment:
   Query against Iceberg Branch/Tag can also benefit from the stats. We can 
optimize this later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Iceberg: Do not use HMS stats when statsSource is Iceberg [hive]

Reply via email to