[
https://issues.apache.org/jira/browse/MAPREDUCE-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974576#action_12974576
]
Min Zhou commented on MAPREDUCE-1981:
-------------------------------------
@Hairong
Thanks for your share, it greatly helps. We currently use 0.19.1, and our
namenode will use LocatedFileStatus array over wire after applied your patch
rather than DirectoryListing object. So the first bug happened.
I have another idea for shorting client's getListing time by caching split
files into DistributedCache. We always scan the same Hive table(or HDFS
directory) many times, it needn't call Namenode's getListing again and again if
the directory doesn't have any changes. My idea is getListing once, then cache
the result splits, the subsequent job submissions reuse this cache without any
getListing calls.
> Improve getSplits performance by using listFiles, the new FileSystem API
> ------------------------------------------------------------------------
>
> Key: MAPREDUCE-1981
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1981
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: job submission
> Reporter: Hairong Kuang
> Assignee: Hairong Kuang
> Fix For: 0.22.0
>
> Attachments: mapredListFiles.patch, mapredListFiles1.patch,
> mapredListFiles2.patch, mapredListFiles3.patch, mapredListFiles4.patch,
> mapredListFiles5.patch
>
>
> This jira will make FileInputFormat and CombinedFileInputForm to use the new
> API, thus reducing the number of RPCs to HDFS NameNode.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.