GitHub user steveloughran opened a pull request:
https://github.com/apache/spark/pull/14731
[SPARK-17159] [streaming]: optimise check for new files in FileInputDStream
## What changes were proposed in this pull request?
This PR optimises the filesystem metadata reads in `FileInputDStream`, by
moving the filters used in `FileSystem.globStatus` and `FileSystem.listStatus`
into filtering of the `FileStatus` instances returned in the results, so
avoiding the need to create `FileStatus` instances within the `FileSystem`
operation.
* This doesn't add overhead to the filtering process; that's done as
post-processing in the`FileSystem` glob/list operations anyway.
* At worst it may result in larger lists being built up and returned.
* For every glob match of a file, the code saves 1 RPC calls to the HDFS
NN; 1 GET against S3
* For every glob match of a directory, the code the code saves 1 RPC call
and 2-3 HTTP calls to S3 for the directory check (including a slow List call
whenever the directory has children as it doesn't exist as a blob any more)
* for the modtime check of every file, it saves a Hadoop RPC call, against
all object stores *which don't implement any client-side cache*, an HTTP GET.
* By entirely eliminating all `getFileStatus()` calls in the listed files,
it should reduce the risk of AWS S3 throttling the HTTP request, as it does
when too many requests are made to parts of a single S3 bucket.
## How was this patch tested?
Running the spark streaming tests as a regression suite. In the SPARK-7481
cloud code, I could add a test against S3 which prints to stdout the exact
number of HTTP requests made to S3 before and after the patch, so as to
validate speedup. (the S3A metrics in Hadoop 2.8+ are accessible at the API
level, but as they are only accessible in a new API added in 2.8; it'd stop
that proposed module building against Hadoop 2.7. Logging and manual assessment
is the only cross-version strategy.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/steveloughran/spark
cloud/SPARK-17159-listfiles
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14731.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14731
----
commit 738c51bb57f331c58a877aa20aa5e2beb1084114
Author: Steve Loughran <[email protected]>
Date: 2016-08-20T10:50:34Z
SPARK-17159: move filtering of directories and files out of glob/list
filters and into filtering of the FileStatus instances returned in the results,
so avoiding the need to create FileStatus intances for
-This doesn't add overhead to the filtering process; that's done as
post-processing in FileSystem anyway. At worst it may result in larger lists
being built up and returned.
-For every glob match, the code saves 2 RPC calls to the HDFS NN
-The code saves 1-3 HTTP calls to S3 for the directory check (including a
slow List call whenever the directory has children as it doesn't exist as a
blob any more)
-for the modtime check of every file, it saves an HTTP GET
The whole modtime cache can be eliminated; it's a performance optimisation
to avoid the overhead of the file checks, one that is no longer needed.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]