[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974576#action_12974576
 ] 

Min Zhou commented on MAPREDUCE-1981:
-------------------------------------

@Hairong

Thanks for your share, it greatly helps.  We currently use 0.19.1,  and our 
namenode will use LocatedFileStatus array over wire after applied your patch  
rather than DirectoryListing object. So the first bug happened. 

I have another idea for shorting client's getListing time by caching split 
files into DistributedCache.  We always scan the same Hive table(or HDFS 
directory) many times, it needn't call Namenode's getListing again and again if 
the directory doesn't  have any changes. My idea is getListing once, then cache 
the result splits,  the subsequent job submissions reuse this cache without any 
getListing calls. 


> Improve getSplits performance by using listFiles, the new FileSystem API
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1981
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1981
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: job submission
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.22.0
>
>         Attachments: mapredListFiles.patch, mapredListFiles1.patch, 
> mapredListFiles2.patch, mapredListFiles3.patch, mapredListFiles4.patch, 
> mapredListFiles5.patch
>
>
> This jira will make FileInputFormat and CombinedFileInputForm to use the new 
> API, thus reducing the number of RPCs to HDFS NameNode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to