[jira] Commented: (HADOOP-3095) Validating input paths and creating splits is slow on S3

Hadoop QA (JIRA) Wed, 04 Jun 2008 01:00:07 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602216#action_12602216
 ]


Hadoop QA commented on HADOOP-3095:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12383321/hadoop-3095-v4.patch
  against trunk revision 662976.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified 
tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 447 javac compiler warnings (more 
than the trunk's current 409 warnings).

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/console

This message is automatically generated.

> Validating input paths and creating splits is slow on S3
> --------------------------------------------------------
>
>                 Key: HADOOP-3095
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3095
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: fs, fs/s3
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.18.0
>
>         Attachments: faster-job-init.patch, hadoop-3095-v2.patch, 
> hadoop-3095-v3.patch, hadoop-3095-v4.patch, hadoop-3095.patch
>
>
> A call to listPaths on S3FileSystem results in an S3 access for each file in 
> the directory being queried. If the input contains hundreds or thousands of 
> files this is prohibitively slow. This method is called in 
> FileInputFormat.validateInput and FileInputFormat.getSplits. This would be 
> easy to fix by overriding listPaths (all four variants) in S3FileSystem to 
> not use listStatus which creates a FileStatus object for each subpath. 
> However, listPaths is deprecated in favour of listStatus so this might be OK 
> as a short term measure, but not longer term.
> But it gets worse: FileInputFormat.getSplits goes on to access S3 a further 
> six times for each input file via these calls:
> 1. fs.isDirectory
> 2. fs.exists
> 3. fs.getLength
> 4. fs.getLength
> 5. fs.exists (from fs.getFileBlockLocations)
> 6. fs.getBlockSize
> So it would be best to change getSplits to use listStatus, and only access S3 
> once for each file. (This would help HDFS too.) This change would require 
> some care since FileInputFormat has a protected method listPaths which 
> subclasses can override (although, in passing I notice validateInput doesn't 
> use listPaths - is this a bug?).
> For input validation, one approach would be to disable it for S3 by creating 
> a custom FileInputFormat. In this case, missing files would be detected 
> during split generation. Alternatively, it may be possible to cache the input 
> paths between validateInput and getSplits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3095) Validating input paths and creating splits is slow on S3

Reply via email to