[ 
https://issues.apache.org/jira/browse/HIVE-16999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bing Li reassigned HIVE-16999:
------------------------------

    Assignee: Bing Li

> Performance bottleneck in the ADD FILE/ARCHIVE commands for an HDFS resource
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-16999
>                 URL: https://issues.apache.org/jira/browse/HIVE-16999
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>            Reporter: Sailee Jain
>            Assignee: Bing Li
>            Priority: Critical
>
> Performance bottleneck is found in adding resource[which is lying on HDFS] to 
> the distributed cache. 
> Commands used are :-
> {code:java}
> 1. ADD ARCHIVE "hdfs://some_dir/archive.tar"
> 2. ADD FILE "hdfs://some_dir/file.txt"
> {code}
> Here is the log corresponding to the archive adding operation:-
> {noformat}
>  converting to local hdfs://some_dir/archive.tar
>  Added resources: [hdfs://some_dir/archive.tar
> {noformat}
> Hive is downloading the resource to the local filesystem [shown in log by 
> "converting to local"]. 
> {color:#d04437}Ideally there is no need to bring the file to the local 
> filesystem when this operation is all about copying the file from one 
> location on HDFS to other location on HDFS[distributed cache].{color}
> This adds lot of performance bottleneck when the the resource is a big file 
> and all commands need the same resource.
> After debugging around the impacted piece of code is found to be :-
> {code:java}
> public List<String> add_resources(ResourceType t, Collection<String> values, 
> boolean convertToUnix)
>       throws RuntimeException {
>     Set<String> resourceSet = resourceMaps.getResourceSet(t);
>     Map<String, Set<String>> resourcePathMap = 
> resourceMaps.getResourcePathMap(t);
>     Map<String, Set<String>> reverseResourcePathMap = 
> resourceMaps.getReverseResourcePathMap(t);
>     List<String> localized = new ArrayList<String>();
>     try {
>       for (String value : values) {
>         String key;
>          {color:#d04437}//get the local path of downloaded jars{color}
>         List<URI> downloadedURLs = resolveAndDownload(t, value, 
> convertToUnix);
>          ;
>       .
> {code}
> {code:java}
>   List<URI> resolveAndDownload(ResourceType t, String value, boolean 
> convertToUnix) throws URISyntaxException,
>       IOException {
>     URI uri = createURI(value);
>     if (getURLType(value).equals("file")) {
>       return Arrays.asList(uri);
>     } else if (getURLType(value).equals("ivy")) {
>       return dependencyResolver.downloadDependencies(uri);
>     } else { // goes here for HDFS
>       return Arrays.asList(createURI(downloadResource(value, 
> convertToUnix))); // Here when the resource is not local it will download it 
> to the local machine.
>     }
>   }
> {code}
> Here, the function resolveAndDownload() always calls the downloadResource() 
> api in case of external filesystem. It should take into consideration the 
> fact that - when the resource is on same HDFS then bringing it on local 
> machine is not a needed step and can be skipped for better performance.
> Thanks,
> Sailee



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to