Optimize input split creation
-----------------------------

                 Key: MAPREDUCE-1973
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1973
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
    Affects Versions: 0.20.2, 0.20.1
         Environment: Intel Nehalem cluster running Red Hat.
            Reporter: Paul Burkhardt
            Priority: Minor


The input split returns the locations that host the file blocks in the split. 
The locations are determined by the getBlockLocations method of the filesystem 
client which requires a remote connection to the filesystem (i.e. HDFS). The 
remote connection is made for each file in the entire input split. For jobs 
with many input files the network connections dominate the cost of writing the 
input split file.

A job requests a listing of the input files from the remote filesystem and 
creates a FileStatus object as a handle for each file in the listing. The 
FileStatus object can be imbued with the necessary host information on the 
remote end and passed to the client-side in the bulk return of the listing 
request. A getHosts method of the FileStatus would then return the locations 
for the blocks comprising that file and eliminate the need for another trip to 
the remote filesystem.

The INodeFile maintains the blocks for a file and is an obvious choice to be 
the originator for the locations of that file. It is also available to the 
FSDirectory which first creates the listing of FileStatus objects. We propose 
that the block locations be generated by the INodeFile to instantiate the 
FileStatus object during the getListing request.

Our tests demonstrated a factor of 2000 speedup for approximately 60,000 input 
files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to