[jira] [Work logged] (HIVE-23791) Optimize ACID stats generation

ASF GitHub Bot (Jira) Wed, 01 Jul 2020 10:57:22 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-23791?focusedWorklogId=453565&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-453565
 ]


ASF GitHub Bot logged work on HIVE-23791:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 01/Jul/20 17:56
            Start Date: 01/Jul/20 17:56
    Worklog Time Spent: 10m 
      Work Description: pvary commented on a change in pull request #1196:
URL: https://github.com/apache/hive/pull/1196#discussion_r448527274



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
##########
@@ -2614,28 +2633,25 @@ public static Path getVersionFilePath(Path deltaOrBase) 
{
           + " from " + jc.get(ValidTxnWriteIdList.VALID_TABLES_WRITEIDS_KEY));
       return null;
     }
-    Directory acidInfo = AcidUtils.getAcidState(fs, dir, jc, idList, null, 
false);
+    if (fs == null) {
+      fs = dir.getFileSystem(jc);
+    }
+    // Collect the all of the files/dirs
+    Map<Path, HdfsDirSnapshot> hdfsDirSnapshots = 
AcidUtils.getHdfsDirSnapshots(fs, dir);

Review comment:
       Ohh.. I think I get it now.
   * You are right that this will do stuff which is not really needed in this 
case - namely creating objects which are not needed at here 
(dirSnapshot.metaDataFile/dirSnapshot.acidFormatFile), also we might list and 
create objects which are not needed in this snapshot. On the other hand the 
costly part on S3 (and on HDFS as well) is the number of remote calls, which is 
reduced to a single listing instead of doing the listing for every directory 
1-by-1.
   * It is not possible that it does not scan some location which needed. If 
this happens then this is a bug in AcidUtils.getAcidState, as it has to return 
every directory which is readable
   
   What I do not understand in your comment is "this method would return a 
something (it could still be a map) which could fill in stuff from hdfs if its 
not cached already" - the main thing we would like to avoid here is the need of 
reading the HDFS again and again. The only way to realize that something is 
missing is reading the directory again... or I miss something :)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 453565)
    Time Spent: 40m  (was: 0.5h)

> Optimize ACID stats generation
> ------------------------------
>
>                 Key: HIVE-23791
>                 URL: https://issues.apache.org/jira/browse/HIVE-23791
>             Project: Hive
>          Issue Type: Improvement
>          Components: Statistics, Transactions
>            Reporter: Peter Vary
>            Assignee: Peter Vary
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently basic stats generation uses file listing for getting statistics, 
> and also uses a file listing for getting the acid state. We should optimize 
> this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23791) Optimize ACID stats generation

Reply via email to