[
https://issues.apache.org/jira/browse/HIVE-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15668735#comment-15668735
]
Sahil Takiar commented on HIVE-15215:
-------------------------------------
Here is the code that trigger the file by file delete (inside the {{Hive.java}}
class):
{code}
replaceFiles(...) {
...
FileSystem fs2 = oldPath.getFileSystem(conf);
if (fs2.exists(oldPath)) {
// Do not delete oldPath if:
// - destf is subdir of oldPath
//if ( !(fs2.equals(destf.getFileSystem(conf)) &&
FileUtils.isSubDir(oldPath, destf, fs2)))
isOldPathUnderDestf = FileUtils.isSubDir(oldPath, destf, fs2);
if (isOldPathUnderDestf) {
// if oldPath is destf or its subdir, its should definitely be
deleted, otherwise its
// existing content might result in incorrect (extra) data.
// But not sure why we changed not to delete the oldPath in
HIVE-8750 if it is
// not the destf or its subdir?
oldPathDeleted = FileUtils.trashFilesUnderDir(fs2, oldPath, conf);
}
}
...
}
{code}
> Files on S3 are deleted one by one in INSERT OVERWRITE queries
> --------------------------------------------------------------
>
> Key: HIVE-15215
> URL: https://issues.apache.org/jira/browse/HIVE-15215
> Project: Hive
> Issue Type: Sub-task
> Components: Hive
> Reporter: Sahil Takiar
>
> When running {{INSERT OVERWRITE}} queries the files to overwrite are deleted
> one by one. The reason is that, by default, hive.exec.stagingdir is inside
> the target table directory.
> Ideally Hive would just delete the entire table directory, but it can't do
> that since the staging data is also inside the directory. Instead it deletes
> each file one-by-one, which is very slow.
> There are a few ways to fix this:
> 1: Move the staging directory outside the table location. This can be done by
> setting hive.exec.stagingdir to a different location when running on S3. It
> would be nice if users didn't have to explicitly set this when running on S3
> and things just worked out-of-the-box. My understanding is that
> hive.exec.stagingdir was only added to support HDFS encryption zones. Since
> S3 doesn't have encryption zones, there should be no problem with using the
> value of hive.exec.scratchdir to store all intermediate data instead.
> 2: Multi-thread the delete operations
> 3: See if the {{S3AFileSystem}} can expose some type of bulk delete op
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)