[jira] [Commented] (HIVE-14128) Parallelize jobClose phases

Rajesh Balamohan (JIRA) Mon, 01 Aug 2016 04:42:30 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-14128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401900#comment-15401900
 ]


Rajesh Balamohan commented on HIVE-14128:
-----------------------------------------

[~ashutoshc] - In non-partitioned case, there can be multiple part files within 
the temp directory. When this is moved in HDFS, it would be simpler. But in 
some file systems like S3, it would turn out to be expensive still.  E.g 
lineitem is a non-partitioned dataset in TPC-H.  Simple insert overwrite would 
have the following move at the end of the job.  Please note that this 
internally has 300+ part files. So it rename would turn out to be expensive 
here.

{noformat}
2016-08-01T04:40:00,154  INFO [JobClose-Thread-0] exec.FileSinkOperator: Moving 
tmp dir: 
s3a://bucket/lineitem/.hive-staging_hive_2016-08-01_04-31-26_432_5317262787271448273-1/_tmp.-ext-10000
 to: 
s3a://bucket/lineitem/.hive-staging_hive_2016-08-01_04-31-26_432_5317262787271448273-1/-ext-10000
{noformat}

Should we consider a file by file move in such cases?

> Parallelize jobClose phases
> ---------------------------
>
>                 Key: HIVE-14128
>                 URL: https://issues.apache.org/jira/browse/HIVE-14128
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>    Affects Versions: 1.2.0, 2.0.0, 2.1.0
>            Reporter: Ashutosh Chauhan
>            Assignee: Ashutosh Chauhan
>         Attachments: HIVE-14128.1.patch, HIVE-14128.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-14128) Parallelize jobClose phases

Reply via email to