[
https://issues.apache.org/jira/browse/HIVE-22548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983416#comment-16983416
]
Steve Loughran commented on HIVE-22548:
---------------------------------------
Also L1644 it calls path.exists() before the listFiles. Has anyone noticed that
is marked as deprecated? There's a reason we warn people about it, and it's
this recurrent code path of exists + operation, which duplicates the expensive
check for files or directories existing.
*just call listStatus and treat a FileNotFoundException as a sign that the path
doesn't exist*
It is exactly what exists() does after all.
While I'm looking at that class
h3. removeEmptyDpDirectory
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1601]
This contains a needless listFiles just to see if directory is empty.
if you use delete(path, false) (i.e. the non-recursive one), it does the check
for having children internally * and rejects the call* . Just swallow any
exception it raises telling you off about this fact.
* we have a test for this for every single file system; it is the same as "rm
dir" on the command line. You do not need to worry about it being implemented
wrong.
h3. removeTempOrDuplicateFiles
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1757]
delete() returns false in only two conditions
# you've tried to delete root
# the file wasn't actually there
you shouldn't need to check and if there is any chance that some other process
would delete the temp file, would convert a no-op into a failure.
h3. getFileSizeRecursively()
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1840]
getFileSizeRecursively() is potentially really expensive too.
[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1853]
this swallows all exception details. Please include the message and the nested
exception. Everyone who fields support calls will appreciate this
> Optimise Utilities.removeTempOrDuplicateFiles when moving files to final
> location
> ---------------------------------------------------------------------------------
>
> Key: HIVE-22548
> URL: https://issues.apache.org/jira/browse/HIVE-22548
> Project: Hive
> Issue Type: Improvement
> Components: Hive
> Affects Versions: 3.1.2
> Reporter: Rajesh Balamohan
> Priority: Major
>
> {{Utilities.removeTempOrDuplicateFiles}}
> is very slow with cloud storage, as it executes {{listStatus}} twice and also
> runs in single threaded mode.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1629
--
This message was sent by Atlassian Jira
(v8.3.4#803005)