[jira] [Commented] (HIVE-15215) Files on S3 are deleted one by one in INSERT OVERWRITE queries

Sahil Takiar (JIRA) Thu, 17 Nov 2016 14:59:26 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675071#comment-15675071
 ]


Sahil Takiar commented on HIVE-15215:
-------------------------------------

Was looking at old code when I filed this, parallel deletes was actually 
implemented in HIVE-13726

Leaving this open for now, re-factoring it to "Investigate if staging data on 
S3 can always go under the scratch dir"

> Files on S3 are deleted one by one in INSERT OVERWRITE queries
> --------------------------------------------------------------
>
>                 Key: HIVE-15215
>                 URL: https://issues.apache.org/jira/browse/HIVE-15215
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Hive
>            Reporter: Sahil Takiar
>
> When running {{INSERT OVERWRITE}} queries the files to overwrite are deleted 
> one by one. The reason is that, by default, hive.exec.stagingdir is inside 
> the target table directory.
> Ideally Hive would just delete the entire table directory, but it can't do 
> that since the staging data is also inside the directory. Instead it deletes 
> each file one-by-one, which is very slow.
> There are a few ways to fix this:
> 1: Move the staging directory outside the table location. This can be done by 
>  setting hive.exec.stagingdir to a different location when running on S3. It 
> would be nice if users didn't have to explicitly set this when running on S3 
> and things just worked out-of-the-box. My understanding is that 
> hive.exec.stagingdir was only added to support HDFS encryption zones. Since 
> S3 doesn't have encryption zones, there should be no problem with using the 
> value of hive.exec.scratchdir to store all intermediate data instead.
> 2: Multi-thread the delete operations
> 3: See if the {{S3AFileSystem}} can expose some type of bulk delete op



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-15215) Files on S3 are deleted one by one in INSERT OVERWRITE queries

Reply via email to