[
https://issues.apache.org/jira/browse/HIVE-16999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bing Li reassigned HIVE-16999:
------------------------------
Assignee: Bing Li
> Performance bottleneck in the ADD FILE/ARCHIVE commands for an HDFS resource
> ----------------------------------------------------------------------------
>
> Key: HIVE-16999
> URL: https://issues.apache.org/jira/browse/HIVE-16999
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Reporter: Sailee Jain
> Assignee: Bing Li
> Priority: Critical
>
> Performance bottleneck is found in adding resource[which is lying on HDFS] to
> the distributed cache.
> Commands used are :-
> {code:java}
> 1. ADD ARCHIVE "hdfs://some_dir/archive.tar"
> 2. ADD FILE "hdfs://some_dir/file.txt"
> {code}
> Here is the log corresponding to the archive adding operation:-
> {noformat}
> converting to local hdfs://some_dir/archive.tar
> Added resources: [hdfs://some_dir/archive.tar
> {noformat}
> Hive is downloading the resource to the local filesystem [shown in log by
> "converting to local"].
> {color:#d04437}Ideally there is no need to bring the file to the local
> filesystem when this operation is all about copying the file from one
> location on HDFS to other location on HDFS[distributed cache].{color}
> This adds lot of performance bottleneck when the the resource is a big file
> and all commands need the same resource.
> After debugging around the impacted piece of code is found to be :-
> {code:java}
> public List<String> add_resources(ResourceType t, Collection<String> values,
> boolean convertToUnix)
> throws RuntimeException {
> Set<String> resourceSet = resourceMaps.getResourceSet(t);
> Map<String, Set<String>> resourcePathMap =
> resourceMaps.getResourcePathMap(t);
> Map<String, Set<String>> reverseResourcePathMap =
> resourceMaps.getReverseResourcePathMap(t);
> List<String> localized = new ArrayList<String>();
> try {
> for (String value : values) {
> String key;
> {color:#d04437}//get the local path of downloaded jars{color}
> List<URI> downloadedURLs = resolveAndDownload(t, value,
> convertToUnix);
> ;
> .
> {code}
> {code:java}
> List<URI> resolveAndDownload(ResourceType t, String value, boolean
> convertToUnix) throws URISyntaxException,
> IOException {
> URI uri = createURI(value);
> if (getURLType(value).equals("file")) {
> return Arrays.asList(uri);
> } else if (getURLType(value).equals("ivy")) {
> return dependencyResolver.downloadDependencies(uri);
> } else { // goes here for HDFS
> return Arrays.asList(createURI(downloadResource(value,
> convertToUnix))); // Here when the resource is not local it will download it
> to the local machine.
> }
> }
> {code}
> Here, the function resolveAndDownload() always calls the downloadResource()
> api in case of external filesystem. It should take into consideration the
> fact that - when the resource is on same HDFS then bringing it on local
> machine is not a needed step and can be skipped for better performance.
> Thanks,
> Sailee
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)