[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

ASF GitHub Bot (Jira) Mon, 29 Mar 2021 01:51:04 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573368&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573368
 ]


ASF GitHub Bot logged work on HIVE-24928:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 29/Mar/21 08:50
            Start Date: 29/Mar/21 08:50
    Worklog Time Spent: 10m 
      Work Description: kgyrtkirk commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r603114825



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TaskCompiler.java
##########
@@ -433,6 +434,12 @@ private String extractTableFullName(StatsTask tsk) throws 
SemanticException {
       return TaskFactory.get(columnStatsWork);
     } else {
       BasicStatsWork statsWork = new 
BasicStatsWork(tableScan.getConf().getTableMetadata().getTableSpec());
+      for (MapWork mapWork :  (Collection<MapWork>) currentTask.getMapWork()) {
+        if (mapWork.getAliasToPartnInfo() != null && 
mapWork.getAliasToPartnInfo().containsKey(table.getTableName())) {

Review comment:
       we have a full table object passed to the `StatsWork` constructor - 
what's wrong with that?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java
##########
@@ -258,7 +268,7 @@ private int aggregateStats(Hive db) {
         Partish p;
         partishes.add(p = new Partish.PTable(table));
 
-        BasicStatsProcessor basicStatsProcessor = new BasicStatsProcessor(p, 
work, conf, followedColStats);
+        BasicStatsProcessor basicStatsProcessor = new 
BasicStatsProcessor(table, p, work, followedColStats);

Review comment:
       I don't think we need these table callse - you may simply use 
`Partish#getTable` to get access to the table object later

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java
##########
@@ -118,11 +118,15 @@ public String getName() {
     private boolean isMissingAcidState = false;
     private BasicStatsWork work;
     private boolean followedColStats1;
+    private boolean isBasicStatProvided;
+    private Map<String, String> providedBasicStats;
 
-    public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf 
conf, boolean followedColStats2) {
+    public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork 
work, boolean followedColStats2) {

Review comment:
       yes; you should use the partish: `Partish.buildFor(table)`
   
   adding a `Table` here will cause confusion...

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java
##########
@@ -118,11 +118,15 @@ public String getName() {
     private boolean isMissingAcidState = false;
     private BasicStatsWork work;
     private boolean followedColStats1;
+    private boolean isBasicStatProvided;
+    private Map<String, String> providedBasicStats;
 
-    public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf 
conf, boolean followedColStats2) {
+    public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork 
work, boolean followedColStats2) {

Review comment:
       you seem to have added a totally independent conditional  logic to this 
class; wouldn't it be easier to simply introduce a new `Processor` class for 
the purpose your are targeting?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java
##########
@@ -118,11 +118,15 @@ public String getName() {
     private boolean isMissingAcidState = false;
     private BasicStatsWork work;
     private boolean followedColStats1;
+    private boolean isBasicStatProvided;
+    private Map<String, String> providedBasicStats;
 
-    public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf 
conf, boolean followedColStats2) {
+    public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork 
work, boolean followedColStats2) {
       this.partish = partish;
       this.work = work;
       followedColStats1 = followedColStats2;
+      providedBasicStats = 
table.getStorageHandler().getBasicStatistics(work.getTableDesc());

Review comment:
       I believe you don't need the `tableDesc` for this method -> you will not 
need to add it to `StatsWork` 

##########
File path: 
iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##########
@@ -153,6 +156,24 @@ public DecomposedPredicate decomposePredicate(JobConf 
jobConf, Deserializer dese
     return predicate;
   }
 
+  @Override
+  public Map<String, String> getBasicStatistics(TableDesc tableDesc) {

Review comment:
       this method seem to be using a `TableDesc` to be able to identify the 
underlying Iceberg table - wouldn't it be possible to do the same from a  
`Table` object?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 573368)
    Time Spent: 1h 10m  (was: 1h)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -------------------------------------------------------------------------
>
>                 Key: HIVE-24928
>                 URL: https://issues.apache.org/jira/browse/HIVE-24928
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 4.0.0
>            Reporter: László Pintér
>            Assignee: László Pintér
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

Reply via email to