Sergio Peña commented on HIVE-14776:

You're right, distcp does not use S3 as a temporary place. While debugging the 
code, I saw a '/user/hdfs/.Trash' directory created on S3 with data files being 
created, but after more investigation, I saw that there were copied by Hive 
when using INSERT OVERWRITE (old data being backed up). 

Anyway, distcp is still slow than not using distcp at all. I've no idea why. I 
run several tests with different file sizes (see times below when copied a 

S3 with distcp: 93s
S3 with no distcp: 37s

S3 with distcp: 255s
S3 with no distcp: 147s

INSERT ... SELECT statements are going to create several files depending on the 
MR jobs and HDFS block-sizes, and they're might be slower than 5G. 

The S3A adapter should already manage multi-part uploads using Amazon API. 
Probably this is why distcp + s3a are not good together? 

> Skip 'distcp' call when copying data from HDSF to S3
> ----------------------------------------------------
>                 Key: HIVE-14776
>                 URL: https://issues.apache.org/jira/browse/HIVE-14776
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Hive
>            Reporter: Sergio Peña
>            Assignee: Sergio Peña
>         Attachments: HIVE-14776.1.patch, HIVE-14776.2.patch
> Hive uses 'distcp' to copy files in parallel between HDFS encryption zones 
> when the {{hive.exec.copyfile.maxsize}} threshold is lower than the file to 
> copy. This 'distcp' is also executed when copying to S3, but it is causing 
> slower copies.
> We should not invoke distcp when copying to blobstore systems.

This message was sent by Atlassian JIRA

Reply via email to