[
https://issues.apache.org/jira/browse/HDFS-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909924#action_12909924
]
Paul Burkhardt commented on HDFS-1402:
--------------------------------------
I decided to patch against the trunk. The changes span both HDFS and Common but
I attached two separate patches to this ticket for now.
As previously noted, this patch addresses the same core issue as HDFS-202. My
concern is HDFS-202 adds a parallel set of interfaces to support file status
objects with location information. My argument is the locations of a file
should be a first-class attribute shared by all file types. If we force an
interface, getHosts or getLocations, for any file status type we can simplify
the client and server API for creating and listing file status objects. File
status types from a distributed file system, i.e. HDFS, return the hosts for
the file blocks whereas a file status type from a non-distributed or local file
system would return a single host, all by the same interface.
> Optimize input split creation
> -----------------------------
>
> Key: HDFS-1402
> URL: https://issues.apache.org/jira/browse/HDFS-1402
> Project: Hadoop HDFS
> Issue Type: Improvement
> Affects Versions: 0.22.0
> Environment: Intel Nehalem cluster running Red Hat.
> Reporter: Paul Burkhardt
> Priority: Minor
> Attachments: HADOOP-1973.patch, HDFS-1402.common.patch,
> HDFS-1402.patch
>
>
> The input split returns the locations that host the file blocks in the split.
> The locations are determined by the getBlockLocations method of the
> filesystem client which requires a remote connection to the filesystem (i.e.
> HDFS). The remote connection is made for each file in the entire input split.
> For jobs with many input files the network connections dominate the cost of
> writing the input split file.
> A job requests a listing of the input files from the remote filesystem and
> creates a FileStatus object as a handle for each file in the listing. The
> FileStatus object can be imbued with the necessary host information on the
> remote end and passed to the client-side in the bulk return of the listing
> request. A getHosts method of the FileStatus would then return the locations
> for the blocks comprising that file and eliminate the need for another trip
> to the remote filesystem.
> The INodeFile maintains the blocks for a file and is an obvious choice to be
> the originator for the locations of that file. It is also available to the
> FSDirectory which first creates the listing of FileStatus objects. We propose
> that the block locations be generated by the INodeFile to instantiate the
> FileStatus object during the getListing request.
> Our tests demonstrated a factor of 2000 speedup for approximately 60,000
> input files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.