[
https://issues.apache.org/jira/browse/MAPREDUCE-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893282#action_12893282
]
Paul Burkhardt commented on MAPREDUCE-1973:
-------------------------------------------
Yes it appears that we are attempting to resolve the same issue, sorry for
missing the prior ticket.
I am in favor of both implementations adding the block locations to the
FileStatus object which is reasonable since it is part of the description of a
file and is then readily accessible by the InputFormat. My approach is more
primitive and lightweight but also independent of any directory management.
> Optimize input split creation
> -----------------------------
>
> Key: MAPREDUCE-1973
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1973
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Affects Versions: 0.20.1, 0.20.2
> Environment: Intel Nehalem cluster running Red Hat.
> Reporter: Paul Burkhardt
> Priority: Minor
> Attachments: HADOOP-1973.patch
>
>
> The input split returns the locations that host the file blocks in the split.
> The locations are determined by the getBlockLocations method of the
> filesystem client which requires a remote connection to the filesystem (i.e.
> HDFS). The remote connection is made for each file in the entire input split.
> For jobs with many input files the network connections dominate the cost of
> writing the input split file.
> A job requests a listing of the input files from the remote filesystem and
> creates a FileStatus object as a handle for each file in the listing. The
> FileStatus object can be imbued with the necessary host information on the
> remote end and passed to the client-side in the bulk return of the listing
> request. A getHosts method of the FileStatus would then return the locations
> for the blocks comprising that file and eliminate the need for another trip
> to the remote filesystem.
> The INodeFile maintains the blocks for a file and is an obvious choice to be
> the originator for the locations of that file. It is also available to the
> FSDirectory which first creates the listing of FileStatus objects. We propose
> that the block locations be generated by the INodeFile to instantiate the
> FileStatus object during the getListing request.
> Our tests demonstrated a factor of 2000 speedup for approximately 60,000
> input files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.