[
https://issues.apache.org/jira/browse/MAPREDUCE-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893420#action_12893420
]
Paul Burkhardt commented on MAPREDUCE-1973:
-------------------------------------------
I haven't really studied HDFS-202 so my following comments may need
clarification. I surmise that HDFS-202 takes the approach of getting the block
locations in batch for a set of file paths and returning a new class of file
status objects with the location information. This method has to manage
recursively traversing sub-directories and also follow symbolic links. For very
large listings the implementation has to incrementally get the block locations
to avoid running out of memory. Thus, the approach couples the creation of file
status objects to getting the file status objects which are orthogonal
operations.
The MAPREDUCE-1973 implementation attempts to address only the creation of file
status objects with block location information, albeit the creation is at the
point of a file listing request that returns all file handles. But, that
listing request can be implemented in various manners to address performance
and directory management similar to the proposal in HDFS-202.
Perhaps the two patches could complement each other.
> Optimize input split creation
> -----------------------------
>
> Key: MAPREDUCE-1973
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1973
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Affects Versions: 0.20.1, 0.20.2
> Environment: Intel Nehalem cluster running Red Hat.
> Reporter: Paul Burkhardt
> Priority: Minor
> Attachments: HADOOP-1973.patch
>
>
> The input split returns the locations that host the file blocks in the split.
> The locations are determined by the getBlockLocations method of the
> filesystem client which requires a remote connection to the filesystem (i.e.
> HDFS). The remote connection is made for each file in the entire input split.
> For jobs with many input files the network connections dominate the cost of
> writing the input split file.
> A job requests a listing of the input files from the remote filesystem and
> creates a FileStatus object as a handle for each file in the listing. The
> FileStatus object can be imbued with the necessary host information on the
> remote end and passed to the client-side in the bulk return of the listing
> request. A getHosts method of the FileStatus would then return the locations
> for the blocks comprising that file and eliminate the need for another trip
> to the remote filesystem.
> The INodeFile maintains the blocks for a file and is an obvious choice to be
> the originator for the locations of that file. It is also available to the
> FSDirectory which first creates the listing of FileStatus objects. We propose
> that the block locations be generated by the INodeFile to instantiate the
> FileStatus object during the getListing request.
> Our tests demonstrated a factor of 2000 speedup for approximately 60,000
> input files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.