[jira] Commented: (HADOOP-5795) Add a bulk FIleSystem.getFileBlockLocations

Doug Cutting (JIRA) Wed, 17 Jun 2009 12:18:38 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720835#action_12720835
 ]


Doug Cutting commented on HADOOP-5795:
--------------------------------------

> I think the extended version of the API would help in doing incremental 
> distcp when hdfs-append is supported.

Thanks for the use case!  An append-savvy incremental distcp might first use 
listStatus to get all file lengths and dates from both filesystems, then figure 
out which had grown longer but whose creation dates had not changed, indicating 
they'd been appended to.  Then a batch call could be made to fetch block 
locations of just newly appended sections, and these would be used to construct 
splits that can be localized well.  Does that sound right?

In this case we would not list directories, but rather always pass in a list of 
individual files.  The mapping from inputs to outputs would be 1:1 so it could 
take the form:

BlockLocation[] getBlockLocations(BlockLocationRequest[])

A corollary is that it does not make sense to pass start/end positions for a 
directory, although these could be ignored.

Do we want to try to develop a single swiss-army-knife batch call, or add 
operation-optimized calls as we go?

> Add a bulk FIleSystem.getFileBlockLocations
> -------------------------------------------
>
>                 Key: HADOOP-5795
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5795
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Arun C Murthy
>            Assignee: Jakob Homan
>             Fix For: 0.21.0
>
>
> Currently map-reduce applications (specifically file-based input-formats) use 
> FileSystem.getFileBlockLocations to compute splits. However they are forced 
> to call it once per file.
> The downsides are multiple:
>    # Even with a few thousand files to process the number of RPCs quickly 
> starts getting noticeable
>    # The current implementation of getFileBlockLocations is too slow since 
> each call results in 'search' in the namesystem. Assuming a few thousand 
> input files it results in that many RPCs and 'searches'.
> It would be nice to have a FileSystem.getFileBlockLocations which can take in 
> a directory, and return the block-locations for all files in that directory. 
> We could eliminate both the per-file RPC and also the 'search' by a 'scan'.
> When I tested this for terasort, a moderate job with 8000 input files the 
> runtime halved from the current 8s to 4s. Clearly this is much more important 
> for latency-sensitive applications...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5795) Add a bulk FIleSystem.getFileBlockLocations

Reply via email to