Re: [PR] Iceberg: Do not use HMS stats when statsSource is Iceberg [hive]

via GitHub Thu, 29 Aug 2024 00:04:19 -0700


zhangbutao commented on code in PR #5400:
URL: https://github.com/apache/hive/pull/5400#discussion_r1735677285



##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##########
@@ -613,7 +616,7 @@ public boolean 
canComputeQueryUsingStats(org.apache.hadoop.hive.ql.metadata.Tabl
         }
       }
     }
-    return false;
+    return true;

Review Comment:
   Good catch! 
   In case of delete files, `analyze table compute stats`  job can get the 
accurate stats as it will launch tez task to compute the stats.
   
   And after the job `analyze table compute stats`, the HMS stats will be 
updated & accurate and  `iceberg.hive.keep.stats` will be true, so we can use 
the HMS stats to optimize the `count `query. 
   
   But if the statsSource is Iceberg & in case of delete files, even we have 
done the job `analyze table compute stats`, we won't update the Iceberg 
`SnapshotSummary`, so we can not optimize the `count `query.
   
   This will look a little weird. Users do a job `analyze table compute stats` 
to update the stats, but they can not optimize the `count `query if the 
statsSource is Iceberg & in case of delete files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Iceberg: Do not use HMS stats when statsSource is Iceberg [hive]

Reply via email to