[jira] Commented: (MAPREDUCE-1973) Optimize input split creation

Paul Burkhardt (JIRA) Wed, 28 Jul 2010 11:10:40 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893282#action_12893282
 ]


Paul Burkhardt commented on MAPREDUCE-1973:
-------------------------------------------

Yes it appears that we are attempting to resolve the same issue, sorry for 
missing the prior ticket.

I am in favor of both implementations adding the block locations to the 
FileStatus object which is reasonable since it is part of the description of a 
file and is then readily accessible by the InputFormat. My approach is more 
primitive and lightweight but also independent of any directory management.

> Optimize input split creation
> -----------------------------
>
>                 Key: MAPREDUCE-1973
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1973
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 0.20.1, 0.20.2
>         Environment: Intel Nehalem cluster running Red Hat.
>            Reporter: Paul Burkhardt
>            Priority: Minor
>         Attachments: HADOOP-1973.patch
>
>
> The input split returns the locations that host the file blocks in the split. 
> The locations are determined by the getBlockLocations method of the 
> filesystem client which requires a remote connection to the filesystem (i.e. 
> HDFS). The remote connection is made for each file in the entire input split. 
> For jobs with many input files the network connections dominate the cost of 
> writing the input split file.
> A job requests a listing of the input files from the remote filesystem and 
> creates a FileStatus object as a handle for each file in the listing. The 
> FileStatus object can be imbued with the necessary host information on the 
> remote end and passed to the client-side in the bulk return of the listing 
> request. A getHosts method of the FileStatus would then return the locations 
> for the blocks comprising that file and eliminate the need for another trip 
> to the remote filesystem.
> The INodeFile maintains the blocks for a file and is an obvious choice to be 
> the originator for the locations of that file. It is also available to the 
> FSDirectory which first creates the listing of FileStatus objects. We propose 
> that the block locations be generated by the INodeFile to instantiate the 
> FileStatus object during the getListing request.
> Our tests demonstrated a factor of 2000 speedup for approximately 60,000 
> input files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1973) Optimize input split creation

Reply via email to