Michel Lemay created HADOOP-15192:
-------------------------------------

             Summary: S3A listStatus generates one 404 error for each path
                 Key: HADOOP-15192
                 URL: https://issues.apache.org/jira/browse/HADOOP-15192
             Project: Hadoop Common
          Issue Type: Improvement
          Components: fs/s3
    Affects Versions: 2.7.3
            Reporter: Michel Lemay


Symptoms:
 - CloudWatch Metrics for S3 showing an unexpectedly large number of 4xx errors 
in our bucket
 - Performance when listing files recursively is abysmal (15 minutes on our 
bucket compared to less than 2 minutes using cli `aws s3 ls`)

Analysis:
 - In CloudTrail logs for this bucket, we found that it generate one 404 
(NoSuchKey) error per folder listed recursively.
 - Spark recursively calls FileSystem::listStatus (S3AFileSystem implementation 
from Hadoop-aws:2.7.3); which in turn calls getFileStatus to determine if it is 
a directory.
 - It turns out that this call to getFileStatus yield a 404 when the path used 
is a directory but do not end with a slash. It then retries with the slash 
concatenated (incurring one extra unneeded call to S3).

Questions:
 - Why is this trailing slash got removed in the first place? (Hadoop Path 
class normalize it by removing trailing slashes when constructed)
 - S3AFileSystem::listStatus needs to know if the path is a Directory. However, 
it’s a common usage pattern to already have that FileStatus object in hand when 
recursively listing files.  Thus incurring an unneeded performance penalty.  
Base FileSystem class could offer an optimized Api to use this assumption (or 
fix listLocatedStatus(recursive=true) unoptimized call to listStatus)
 - I might be wrong on this last bullet but I think S3 object api will fetch 
every objects under a prefix (not just current level) and filter them out.  If 
that is the case, there should be opportunities to have an efficient recursive 
listStatus implementation for s3 using paginated calls to top level folder only.

 

Note, all this is in the context of spark jobs reading hundred of thousands of 
parquet files organized and partitioned hierarchically as recommended. Every 
time we read it, spark lists recursively all files and folders to discover what 
are the partitions (folder names).

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to