GitHub user steveloughran opened a pull request:

    https://github.com/apache/spark/pull/14731

    [SPARK-17159] [streaming]: optimise check for new files in FileInputDStream

    ## What changes were proposed in this pull request?
    
    This PR optimises the filesystem metadata reads in `FileInputDStream`, by 
moving the filters used in `FileSystem.globStatus` and `FileSystem.listStatus` 
into filtering of the `FileStatus` instances returned in the results, so 
avoiding the need to create `FileStatus` instances within the `FileSystem` 
operation.
    
    * This doesn't add overhead to the filtering process; that's done as 
post-processing in the`FileSystem` glob/list operations anyway.
    * At worst it may result in larger lists being built up and returned.
    * For every glob match of a file, the code saves 1 RPC calls to the HDFS 
NN; 1 GET against S3
    * For every glob match of a directory, the code the code saves 1 RPC call 
and 2-3 HTTP calls to S3 for the directory check (including a slow List call 
whenever the directory has children as it doesn't exist as a blob any more)
    * for the modtime check of every file, it saves a Hadoop RPC call, against 
all object stores *which don't implement any client-side cache*, an HTTP GET. 
    * By entirely eliminating all `getFileStatus()` calls in the listed files, 
it should reduce the risk of AWS S3 throttling the HTTP request, as it does 
when too many requests are made to parts of a single S3 bucket.
    
    ## How was this patch tested?
    
    Running the spark streaming tests as a regression suite. In the SPARK-7481 
cloud code, I could add a test against S3 which prints to stdout the exact 
number of HTTP requests made to S3 before and after the patch, so as to 
validate speedup. (the S3A metrics in Hadoop 2.8+ are accessible at the API 
level, but as they are only accessible in a new API added in 2.8; it'd stop 
that proposed module building against Hadoop 2.7. Logging and manual assessment 
is the only cross-version strategy.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/steveloughran/spark 
cloud/SPARK-17159-listfiles

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14731.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14731
    
----
commit 738c51bb57f331c58a877aa20aa5e2beb1084114
Author: Steve Loughran <ste...@apache.org>
Date:   2016-08-20T10:50:34Z

    SPARK-17159: move filtering of directories and files out of glob/list 
filters and into filtering of the FileStatus instances returned in the results, 
so avoiding the need to create FileStatus intances for
    
    -This doesn't add overhead to the filtering process; that's done as 
post-processing in FileSystem anyway. At worst it may result in larger lists 
being built up and returned.
    -For every glob match, the code saves 2 RPC calls to the HDFS NN
    -The code saves 1-3 HTTP calls to S3 for the directory check (including a 
slow List call whenever the directory has children as it doesn't exist as a 
blob any more)
    -for the modtime check of every file, it saves an HTTP GET
    
    The whole modtime cache can be eliminated; it's a performance optimisation 
to avoid the overhead of the file checks, one that is no longer needed.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to