[
https://issues.apache.org/jira/browse/HIVE-23791?focusedWorklogId=453565&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-453565
]
ASF GitHub Bot logged work on HIVE-23791:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 01/Jul/20 17:56
Start Date: 01/Jul/20 17:56
Worklog Time Spent: 10m
Work Description: pvary commented on a change in pull request #1196:
URL: https://github.com/apache/hive/pull/1196#discussion_r448527274
##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java
##########
@@ -2614,28 +2633,25 @@ public static Path getVersionFilePath(Path deltaOrBase)
{
+ " from " + jc.get(ValidTxnWriteIdList.VALID_TABLES_WRITEIDS_KEY));
return null;
}
- Directory acidInfo = AcidUtils.getAcidState(fs, dir, jc, idList, null,
false);
+ if (fs == null) {
+ fs = dir.getFileSystem(jc);
+ }
+ // Collect the all of the files/dirs
+ Map<Path, HdfsDirSnapshot> hdfsDirSnapshots =
AcidUtils.getHdfsDirSnapshots(fs, dir);
Review comment:
Ohh.. I think I get it now.
* You are right that this will do stuff which is not really needed in this
case - namely creating objects which are not needed at here
(dirSnapshot.metaDataFile/dirSnapshot.acidFormatFile), also we might list and
create objects which are not needed in this snapshot. On the other hand the
costly part on S3 (and on HDFS as well) is the number of remote calls, which is
reduced to a single listing instead of doing the listing for every directory
1-by-1.
* It is not possible that it does not scan some location which needed. If
this happens then this is a bug in AcidUtils.getAcidState, as it has to return
every directory which is readable
What I do not understand in your comment is "this method would return a
something (it could still be a map) which could fill in stuff from hdfs if its
not cached already" - the main thing we would like to avoid here is the need of
reading the HDFS again and again. The only way to realize that something is
missing is reading the directory again... or I miss something :)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 453565)
Time Spent: 40m (was: 0.5h)
> Optimize ACID stats generation
> ------------------------------
>
> Key: HIVE-23791
> URL: https://issues.apache.org/jira/browse/HIVE-23791
> Project: Hive
> Issue Type: Improvement
> Components: Statistics, Transactions
> Reporter: Peter Vary
> Assignee: Peter Vary
> Priority: Major
> Labels: pull-request-available
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Currently basic stats generation uses file listing for getting statistics,
> and also uses a file listing for getting the acid state. We should optimize
> this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)