[
https://issues.apache.org/jira/browse/HADOOP-15192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Loughran updated HADOOP-15192:
------------------------------------
Summary: S3A listStatus excessively slow -hurts Spark job partitioning
(was: S3A listStatus excessively slow)
> S3A listStatus excessively slow -hurts Spark job partitioning
> -------------------------------------------------------------
>
> Key: HADOOP-15192
> URL: https://issues.apache.org/jira/browse/HADOOP-15192
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs/s3
> Affects Versions: 2.7.3
> Reporter: Michel Lemay
> Priority: Minor
> Fix For: 2.8.0
>
>
> Symptoms:
> - CloudWatch Metrics for S3 showing an unexpectedly large number of 4xx
> errors in our bucket
> - Performance when listing files recursively is abysmal (15 minutes on our
> bucket compared to less than 2 minutes using cli `aws s3 ls`)
> Analysis:
> - In CloudTrail logs for this bucket, we found that it generate one 404
> (NoSuchKey) error per folder listed recursively.
> - Spark recursively calls FileSystem::listStatus (S3AFileSystem
> implementation from Hadoop-aws:2.7.3); which in turn calls getFileStatus to
> determine if it is a directory.
> - It turns out that this call to getFileStatus yield a 404 when the path
> used is a directory but do not end with a slash. It then retries with the
> slash concatenated (incurring one extra unneeded call to S3).
> Questions:
> - Why is this trailing slash got removed in the first place? (Hadoop Path
> class normalize it by removing trailing slashes when constructed)
> - S3AFileSystem::listStatus needs to know if the path is a Directory.
> However, it’s a common usage pattern to already have that FileStatus object
> in hand when recursively listing files. Thus incurring an unneeded
> performance penalty. Base FileSystem class could offer an optimized Api to
> use this assumption (or fix listLocatedStatus(recursive=true) unoptimized
> call to listStatus)
> - I might be wrong on this last bullet but I think S3 object api will fetch
> every objects under a prefix (not just current level) and filter them out.
> If that is the case, there should be opportunities to have an efficient
> recursive listStatus implementation for s3 using paginated calls to top level
> folder only.
>
> Note, all this is in the context of spark jobs reading hundred of thousands
> of parquet files organized and partitioned hierarchically as recommended.
> Every time we read it, spark lists recursively all files and folders to
> discover what are the partitions (folder names).
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]