Validating input paths and creating splits is slow on S3
--------------------------------------------------------

                 Key: HADOOP-3095
                 URL: https://issues.apache.org/jira/browse/HADOOP-3095
             Project: Hadoop Core
          Issue Type: Improvement
          Components: fs, fs/s3
            Reporter: Tom White


A call to listPaths on S3FileSystem results in an S3 access for each file in 
the directory being queried. If the input contains hundreds or thousands of 
files this is prohibitively slow. This method is called in 
FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy 
to fix by overriding listPaths (all four variants) in S3FileSystem to not use 
listStatus which creates a FileStatus object for each subpath. However, 
listPaths is deprecated in favour of listStatus so this might be OK as a short 
term measure, but not longer term.

But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six 
times for each input file via these calls:

1. fs.isDirectory
2. fs.exists
3. fs.getLength
4. fs.getLength
5. fs.exists (from fs.getFileBlockLocations)
6. fs.getBlockSize

So it would be best to change getSplits to use listStatus, and only access S3 
once for each file. (This would help HDFS too.) This change would require some 
care since FileInputFormat has a protected method listPaths which subclasses 
can override (although, in passing I notice validateInput doesn't use listPaths 
- is this a bug?).

For input validation, one approach would be to disable it for S3 by creating a 
custom FileInputFormat. In this case, missing files would be detected during 
split generation. Alternatively, it may be possible to cache the input paths 
between validateInput and getSplits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to