[
https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Owen O'Malley reassigned HADOOP-3095:
-------------------------------------
Assignee: Owen O'Malley
> Validating input paths and creating splits is slow on S3
> --------------------------------------------------------
>
> Key: HADOOP-3095
> URL: https://issues.apache.org/jira/browse/HADOOP-3095
> Project: Hadoop Core
> Issue Type: Improvement
> Components: fs, fs/s3
> Reporter: Tom White
> Assignee: Owen O'Malley
> Attachments: faster-job-init.patch
>
>
> A call to listPaths on S3FileSystem results in an S3 access for each file in
> the directory being queried. If the input contains hundreds or thousands of
> files this is prohibitively slow. This method is called in
> FileInputFormat.validateInput and FileInputFormat.getSplits. This would be
> easy to fix by overriding listPaths (all four variants) in S3FileSystem to
> not use listStatus which creates a FileStatus object for each subpath.
> However, listPaths is deprecated in favour of listStatus so this might be OK
> as a short term measure, but not longer term.
> But it gets worse: FileInputFormat.getSplits goes on to access S3 a further
> six times for each input file via these calls:
> 1. fs.isDirectory
> 2. fs.exists
> 3. fs.getLength
> 4. fs.getLength
> 5. fs.exists (from fs.getFileBlockLocations)
> 6. fs.getBlockSize
> So it would be best to change getSplits to use listStatus, and only access S3
> once for each file. (This would help HDFS too.) This change would require
> some care since FileInputFormat has a protected method listPaths which
> subclasses can override (although, in passing I notice validateInput doesn't
> use listPaths - is this a bug?).
> For input validation, one approach would be to disable it for S3 by creating
> a custom FileInputFormat. In this case, missing files would be detected
> during split generation. Alternatively, it may be possible to cache the input
> paths between validateInput and getSplits.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.