[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

Steve Loughran (JIRA) Wed, 12 Apr 2017 05:14:14 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965716#comment-15965716
 ]


Steve Loughran commented on MAPREDUCE-5907:
-------------------------------------------

I don't know anyone looking at it.

It's an out of date patch, combining optimisations in the FS code, S3N and HAR 
FS implmentations, & changes in the MR Code to match

If the changes to the mapreduce module can go in today, using the existing 
{{FileSystem.listFiles(path, recursive}} call then it''ll be straightforward: 
that's the only bit which needs review and merge; S3A already handles that 
recursively very efficiently, and the other object stores can be brought up to 
speed.

If we need changes to the FS, well, I'm not against them (there's definite 
inconsistencies there), but it's a more serious change: the HDFS team will need 
to look at that, we'll need changes to the FS spec, contract tests, etc, etc. 
Lots of work and so harder to get in.

Why not see if you can apply just the MR changes, and what happens?

> Improve getSplits() performance for fs implementations that can utilize 
> performance gains from recursive listing
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5907
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 2.4.0
>            Reporter: Sumit Kumar
>            Assignee: Sumit Kumar
>              Labels: BB2015-05-TBR
>         Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, 
> MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive 
> listing while calculating splits. They however do this by doing listing level 
> by level. That means to discover files in /foo/bar means they do listing at 
> /foo/bar first to get the immediate children, then make the same call on all 
> immediate children for /foo/bar to discover their immediate children and so 
> on. This doesn't scale well for object store based fs implementations like s3 
> and swift because every listStatus call ends up being a webservice call to 
> backend. In cases where large number of files are considered for input, this 
> makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to 
> the fs implementations to optimize. The behavior remains the same for other 
> implementations (that is a default implementation is provided for other fs so 
> they don't have to implement anything new). However for objectstore based fs 
> implementations it provides a simple change to include recursive flag as true 
> (as shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing

Reply via email to