This is discussed in:
https://issues.apache.org/jira/browse/HADOOP-3095
If this gets fixed in the next week it will make it into 0.18.
Doug
Kyle Sampson wrote:
We're using Hadoop 0.17 with S3 as the filesystem. We've created a
custom InputFormat for our data. One of the things it needs to do is on
InputFormat.getSplits() list all of the files and directories under a
certain path, and there may be thousands of entries in there. It's
using FileSystem.listStatus() to get these paths. With S3, this is
turning out to be extraordinarily slow with directories that contain on
the order of thousands of subdirectories and files.
Looking into it a bit, it seems listStatus() is making a call to S3 for
every subdirectory or file found to get extra file status information.
It seems there used to be a listPaths() method that would just get the
paths, but that's been deprecated and removed. Is there any way
currently to get just a list of paths without status information?
Kyle Sampson
[EMAIL PROTECTED]