[ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=579836&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579836
 ]

ASF GitHub Bot logged work on HIVE-24928:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 09/Apr/21 07:45
            Start Date: 09/Apr/21 07:45
    Worklog Time Spent: 10m 
      Work Description: pvary commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r610417708



##########
File path: 
iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -153,6 +156,37 @@ public DecomposedPredicate decomposePredicate(JobConf 
jobConf, Deserializer dese
     return predicate;
   }
 
+  @Override
+  public boolean canProvideBasicStatistics() {
+    return true;
+  }
+
+  @Override
+  public Map<String, String> getBasicStatistics(TableDesc tableDesc) {
+    Table table = Catalogs.loadTable(conf, tableDesc.getProperties());
+    Map<String, String> stats = new HashMap<>();
+    if (table.currentSnapshot() != null) {
+      Map<String, String> summary = table.currentSnapshot().summary();
+      if (summary != null) {
+        if (summary.containsKey(SnapshotSummary.TOTAL_DATA_FILES_PROP)) {
+          stats.put(StatsSetupConst.NUM_FILES, 
summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP));
+        }
+        if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) {
+          stats.put(StatsSetupConst.ROW_COUNT, 
summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
+        }
+        // TODO: add TOTAL_SIZE when iceberg 0.12 is released
+        if (summary.containsKey("total-files-size")) {
+          stats.put(StatsSetupConst.TOTAL_SIZE, 
summary.get("total-files-size"));
+        }
+      }
+    } else {
+      stats.put(StatsSetupConst.NUM_FILES, "0");

Review comment:
       Is this for empty table, or when we do not have statistics at hand?
   We might want to handle the situation when we do not have statistics 
calculated yet, or we have an incomplete table info.
   
   On the Iceberg dev list I have seen this conversation:
   
https://mail-archives.apache.org/mod_mbox/iceberg-dev/202104.mbox/%3c9a11adb4-27d8-40f1-8141-531287c03...@gmail.com%3e
   
   > So the tldr, Missing is OK, but inaccurate is not




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 579836)
    Time Spent: 5.5h  (was: 5h 20m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -------------------------------------------------------------------------
>
>                 Key: HIVE-24928
>                 URL: https://issues.apache.org/jira/browse/HIVE-24928
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 4.0.0
>            Reporter: László Pintér
>            Assignee: László Pintér
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to