[
https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602216#action_12602216
]
Hadoop QA commented on HADOOP-3095:
-----------------------------------
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12383321/hadoop-3095-v4.patch
against trunk revision 662976.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified
tests.
Please justify why no tests are needed for this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
-1 javac. The applied patch generated 447 javac compiler warnings (more
than the trunk's current 409 warnings).
+1 findbugs. The patch does not introduce any new Findbugs warnings.
+1 release audit. The applied patch does not increase the total number of
release audit warnings.
+1 core tests. The patch passed core unit tests.
-1 contrib tests. The patch failed contrib unit tests.
Test results:
http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/testReport/
Findbugs warnings:
http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results:
http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/artifact/trunk/build/test/checkstyle-errors.html
Console output:
http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/console
This message is automatically generated.
> Validating input paths and creating splits is slow on S3
> --------------------------------------------------------
>
> Key: HADOOP-3095
> URL: https://issues.apache.org/jira/browse/HADOOP-3095
> Project: Hadoop Core
> Issue Type: Improvement
> Components: fs, fs/s3
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.18.0
>
> Attachments: faster-job-init.patch, hadoop-3095-v2.patch,
> hadoop-3095-v3.patch, hadoop-3095-v4.patch, hadoop-3095.patch
>
>
> A call to listPaths on S3FileSystem results in an S3 access for each file in
> the directory being queried. If the input contains hundreds or thousands of
> files this is prohibitively slow. This method is called in
> FileInputFormat.validateInput and FileInputFormat.getSplits. This would be
> easy to fix by overriding listPaths (all four variants) in S3FileSystem to
> not use listStatus which creates a FileStatus object for each subpath.
> However, listPaths is deprecated in favour of listStatus so this might be OK
> as a short term measure, but not longer term.
> But it gets worse: FileInputFormat.getSplits goes on to access S3 a further
> six times for each input file via these calls:
> 1. fs.isDirectory
> 2. fs.exists
> 3. fs.getLength
> 4. fs.getLength
> 5. fs.exists (from fs.getFileBlockLocations)
> 6. fs.getBlockSize
> So it would be best to change getSplits to use listStatus, and only access S3
> once for each file. (This would help HDFS too.) This change would require
> some care since FileInputFormat has a protected method listPaths which
> subclasses can override (although, in passing I notice validateInput doesn't
> use listPaths - is this a bug?).
> For input validation, one approach would be to disable it for S3 by creating
> a custom FileInputFormat. In this case, missing files would be detected
> during split generation. Alternatively, it may be possible to cache the input
> paths between validateInput and getSplits.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.