[
https://issues.apache.org/jira/browse/HIVE-25277?focusedWorklogId=646433&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-646433
]
ASF GitHub Bot logged work on HIVE-25277:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 03/Sep/21 18:39
Start Date: 03/Sep/21 18:39
Worklog Time Spent: 10m
Work Description: coufon commented on a change in pull request #2421:
URL: https://github.com/apache/hive/pull/2421#discussion_r702097544
##########
File path:
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##########
@@ -5240,16 +5259,38 @@ public DropPartitionsResult drop_partitions_req(
for (Path path : archToDelete) {
wh.deleteDir(path, true, mustPurge, needsCm);
}
+
+ // Uses a priority queue to delete the parents of deleted directories
if empty.
+ // The parent with the largest size is always processed first. It
guarantees that
+ // the emptiness of a parent won't be changed once it has been
processed. So duplicated
+ // processing can be avoided.
+ PriorityQueue<PathAndPartValSize> parentsToDelete = new
PriorityQueue<>();
for (PathAndPartValSize p : dirsToDelete) {
wh.deleteDir(p.path, true, mustPurge, needsCm);
+ addParentForDel(parentsToDelete, p);
+ }
+
+ HashSet<PathAndPartValSize> processed = new HashSet<>();
+ while (!parentsToDelete.isEmpty()) {
try {
- deleteParentRecursive(p.path.getParent(), p.partValSize - 1,
mustPurge, needsCm);
+ PathAndPartValSize p = parentsToDelete.poll();
+ if (processed.contains(p)) {
+ continue;
+ }
+ processed.add(p);
+
+ Path path = p.path;
+ if (wh.isWritable(path) && wh.isDir(path) && wh.isEmptyDir(path)) {
Review comment:
wh.isEmptyDir uses listStatus that doesn't distinguish file and dir (at
least for the GCS fs implementation:
https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/7825ab50c839aea43f1ff587b0e2803047af99bc/gcsio/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageFileSystem.java#L997).
But I agree that isEmptyDir is enough no matter the path is a file or dir.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 646433)
Time Spent: 3.5h (was: 3h 20m)
> Slow Hive partition deletion for Cloud object stores with expensive ListFiles
> -----------------------------------------------------------------------------
>
> Key: HIVE-25277
> URL: https://issues.apache.org/jira/browse/HIVE-25277
> Project: Hive
> Issue Type: Improvement
> Components: Standalone Metastore
> Affects Versions: All Versions
> Reporter: Zhou Fang
> Assignee: Zhou Fang
> Priority: Major
> Labels: pull-request-available
> Time Spent: 3.5h
> Remaining Estimate: 0h
>
> Deleting a Hive partition is slow when use a Cloud object store as the
> warehouse for which ListFiles is expensive. A root cause is that the
> recursive parent dir deletion is very inefficient: there are many duplicated
> calls to isEmpty (ListFiles is called at the end). This fix sorts the parents
> to delete according to the path size, and always processes the longest one
> (e.g., a/b/c is always before a/b). As a result, each parent path is only
> needed to be checked once.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)