GitHub user steveloughran opened a pull request: https://github.com/apache/spark/pull/17745
[SPARK-17159][Streaming] optimise check for new files in FileInputDStream ## What changes were proposed in this pull request? Changes to `FileInputDStream` to eliminate multiple `getFileStatus()` calls when scanning directories for new files. This is a minor optimisation when working with filesystems, but significant when working with object stores, as it eliminates HTTP requests per source file scanning the system. The current cost is 1-3 probing to see if a path is a directory or not, one more to actually timestamp a file. The new patch gets the file status and retains it through all the operations, so does not need to re-evaluate it. The impact of this optimisation is 3 HTTP requests per source directory and 1 per file, for every single directory in the scan list, and for every file in the scanned directories, irrespective of the age of the directories. At 100+mS per HEAD request against S3, the speedup is significant, even when there are few files in the scanned directories. #### Before 1. Two separate list operations, `globStatus()` to find directories, then `listStatus()` to scan for new files under directories. 1. The path filter in the `globStatus()` operations calls `getFileStatus(filename)` to probe for a file being a directory; 1. `getFileStatus()` is also used in the `listStatus()` call to check the timestamp. Against an object store `getFileStatus()` can cost 1-4 HTTPS requests per call (HEAD path, HEAD path + "/", LIST path), As both list operations return an array or iterator of `FileStatus` objects, the operations are utterly superfluous. Instead the filtering can take place after the listing has returned. #### After 1. The output of `globStatus()` is filtered to select only directories. 1. The output of `listStatus()` is filtered by timestamp. 1. The special failure case of `globStatus()`: no path, is handled specially in the warning text by saying "No Directory to scan", and omitting the full stack trace. 1. The `fileToModTime` map is superflous, and so deleted. ## How was this patch tested? 1. There is a new test in `org.apache.spark.streaming.InputStreamsSuite` 1. I have object store integration tests in an external repository, which have been used to verify functionality and that the number of HTTP requests is reduced when invoked against S3A endpoints. You can merge this pull request into a Git repository by running: $ git pull https://github.com/steveloughran/spark cloud/SPARK-17159-listfiles-minimal Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17745.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17745 ---- commit f3ffe1db2e5edc9b6a60fb48b34b3099853e4324 Author: Steve Loughran <ste...@hortonworks.com> Date: 2017-04-24T13:04:04Z SPARK-17159 minimal patch of hchanges to FileInputDStream to reduce File status requests when querying files. This is a minor optimisation when working with filesystems, but significant when working with object stores. Change-Id: I269d98902f615818941c88de93a124c65453756e ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org