GitHub user steveloughran opened a pull request: https://github.com/apache/spark/pull/14731
[SPARK-17159] [streaming]: optimise check for new files in FileInputDStream ## What changes were proposed in this pull request? This PR optimises the filesystem metadata reads in `FileInputDStream`, by moving the filters used in `FileSystem.globStatus` and `FileSystem.listStatus` into filtering of the `FileStatus` instances returned in the results, so avoiding the need to create `FileStatus` instances within the `FileSystem` operation. * This doesn't add overhead to the filtering process; that's done as post-processing in the`FileSystem` glob/list operations anyway. * At worst it may result in larger lists being built up and returned. * For every glob match of a file, the code saves 1 RPC calls to the HDFS NN; 1 GET against S3 * For every glob match of a directory, the code the code saves 1 RPC call and 2-3 HTTP calls to S3 for the directory check (including a slow List call whenever the directory has children as it doesn't exist as a blob any more) * for the modtime check of every file, it saves a Hadoop RPC call, against all object stores *which don't implement any client-side cache*, an HTTP GET. * By entirely eliminating all `getFileStatus()` calls in the listed files, it should reduce the risk of AWS S3 throttling the HTTP request, as it does when too many requests are made to parts of a single S3 bucket. ## How was this patch tested? Running the spark streaming tests as a regression suite. In the SPARK-7481 cloud code, I could add a test against S3 which prints to stdout the exact number of HTTP requests made to S3 before and after the patch, so as to validate speedup. (the S3A metrics in Hadoop 2.8+ are accessible at the API level, but as they are only accessible in a new API added in 2.8; it'd stop that proposed module building against Hadoop 2.7. Logging and manual assessment is the only cross-version strategy. You can merge this pull request into a Git repository by running: $ git pull https://github.com/steveloughran/spark cloud/SPARK-17159-listfiles Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14731.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14731 ---- commit 738c51bb57f331c58a877aa20aa5e2beb1084114 Author: Steve Loughran <ste...@apache.org> Date: 2016-08-20T10:50:34Z SPARK-17159: move filtering of directories and files out of glob/list filters and into filtering of the FileStatus instances returned in the results, so avoiding the need to create FileStatus intances for -This doesn't add overhead to the filtering process; that's done as post-processing in FileSystem anyway. At worst it may result in larger lists being built up and returned. -For every glob match, the code saves 2 RPC calls to the HDFS NN -The code saves 1-3 HTTP calls to S3 for the directory check (including a slow List call whenever the directory has children as it doesn't exist as a blob any more) -for the modtime check of every file, it saves an HTTP GET The whole modtime cache can be eliminated; it's a performance optimisation to avoid the overhead of the file checks, one that is no longer needed. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org