Sahil Takiar updated HIVE-14864:
    Attachment: HIVE-14864.patch

Attaching a pre-lim patch just for reference. The patch is still a WIP.

Changes are pretty simple, {{FileUtils.copy}} uses {{getContentSummary}} to get 
the number of files under the folder. It triggers a Distcp job based on the 
size of files under the folder + the number of files under the folder.

If only a single file needs to be copied, the {{ContentSummary}} length will be 
the size of that file, and the number of files under it will be 1.

For now the logic is pretty simple, if the number of files exceeds a threshold 
set by {{hive.exec.copyfile.maxnumfiles}} (which defaults to 1) and the size of 
the files exceeds a threshold set by hive.exec.copyfile.maxsize (which defaults 
to 32 MB), the Distcp job will be triggered.

So basically any folder that contains more than 1 file and whose total contents 
is greater than 32 MB.

> Distcp is not called from MoveTask when src is a directory
> ----------------------------------------------------------
>                 Key: HIVE-14864
>                 URL: https://issues.apache.org/jira/browse/HIVE-14864
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Vihang Karajgaonkar
>            Assignee: Sahil Takiar
>         Attachments: HIVE-14864.patch
> In FileUtils.java the following code does not get executed even when src 
> directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because 
> srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We 
> should use srcFS.getContentSummary(src).getLength() instead.
> {noformat}
>     /* Run distcp if source file/dir is too big */
>     if (srcFS.getUri().getScheme().equals("hdfs") &&
>         srcFS.getFileStatus(src).getLen() > 
> conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) {
>       LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. 
> (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + 
> ")");
>       LOG.info("Launch distributed copy (distcp) job.");
>       HiveConfUtil.updateJobCredentialProviders(conf);
>       copied = shims.runDistCp(src, dst, conf);
>       if (copied && deleteSource) {
>         srcFS.delete(src, true);
>       }
>     }
> {noformat}

This message was sent by Atlassian JIRA

Reply via email to