[
https://issues.apache.org/jira/browse/HDFS-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Burkhardt moved MAPREDUCE-1973 to HDFS-1402:
-------------------------------------------------
Project: Hadoop HDFS (was: Hadoop Map/Reduce)
Key: HDFS-1402 (was: MAPREDUCE-1973)
Affects Version/s: 0.22.0
(was: 0.20.1)
(was: 0.20.2)
> Optimize input split creation
> -----------------------------
>
> Key: HDFS-1402
> URL: https://issues.apache.org/jira/browse/HDFS-1402
> Project: Hadoop HDFS
> Issue Type: Improvement
> Affects Versions: 0.22.0
> Environment: Intel Nehalem cluster running Red Hat.
> Reporter: Paul Burkhardt
> Priority: Minor
> Attachments: HADOOP-1973.patch
>
>
> The input split returns the locations that host the file blocks in the split.
> The locations are determined by the getBlockLocations method of the
> filesystem client which requires a remote connection to the filesystem (i.e.
> HDFS). The remote connection is made for each file in the entire input split.
> For jobs with many input files the network connections dominate the cost of
> writing the input split file.
> A job requests a listing of the input files from the remote filesystem and
> creates a FileStatus object as a handle for each file in the listing. The
> FileStatus object can be imbued with the necessary host information on the
> remote end and passed to the client-side in the bulk return of the listing
> request. A getHosts method of the FileStatus would then return the locations
> for the blocks comprising that file and eliminate the need for another trip
> to the remote filesystem.
> The INodeFile maintains the blocks for a file and is an obvious choice to be
> the originator for the locations of that file. It is also available to the
> FSDirectory which first creates the listing of FileStatus objects. We propose
> that the block locations be generated by the INodeFile to instantiate the
> FileStatus object during the getListing request.
> Our tests demonstrated a factor of 2000 speedup for approximately 60,000
> input files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.