Optimize input split creation
-----------------------------
Key: MAPREDUCE-1973
URL: https://issues.apache.org/jira/browse/MAPREDUCE-1973
Project: Hadoop Map/Reduce
Issue Type: Improvement
Affects Versions: 0.20.2, 0.20.1
Environment: Intel Nehalem cluster running Red Hat.
Reporter: Paul Burkhardt
Priority: Minor
The input split returns the locations that host the file blocks in the split.
The locations are determined by the getBlockLocations method of the filesystem
client which requires a remote connection to the filesystem (i.e. HDFS). The
remote connection is made for each file in the entire input split. For jobs
with many input files the network connections dominate the cost of writing the
input split file.
A job requests a listing of the input files from the remote filesystem and
creates a FileStatus object as a handle for each file in the listing. The
FileStatus object can be imbued with the necessary host information on the
remote end and passed to the client-side in the bulk return of the listing
request. A getHosts method of the FileStatus would then return the locations
for the blocks comprising that file and eliminate the need for another trip to
the remote filesystem.
The INodeFile maintains the blocks for a file and is an obvious choice to be
the originator for the locations of that file. It is also available to the
FSDirectory which first creates the listing of FileStatus objects. We propose
that the block locations be generated by the INodeFile to instantiate the
FileStatus object during the getListing request.
Our tests demonstrated a factor of 2000 speedup for approximately 60,000 input
files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.