GitHub user brkyvz opened a pull request:

    https://github.com/apache/spark/pull/15122

    [SPARK-17569] Make StructuredStreaming FileStreamSource batch generation 
faster

    ## What changes were proposed in this pull request?
    
    While getting the batch for a `FileStreamSource` in StructuredStreaming, we 
know which files we must take specifically. We already have verified that they 
exist, and have committed them to a metadata log. When creating the 
FileSourceRelation however for an incremental execution, the code checks the 
existence of every single file once again!
    
    When you have 100,000s of files in a folder, creating the first batch takes 
2 hours+ when working with S3! This PR disables that check
    
    
    ## How was this patch tested?
    
    Couldn't find any easy way to add a unit test for this. It's not easy to 
add a mock `FileSystem`, because the `FileSystem` instance is generated per 
file in the method itself. Only way to work around it maybe would be to expose 
a method for testing that generates the `FileSystem`?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/brkyvz/spark SPARK-17569

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15122.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15122
    
----
commit 9b7e2de8d9b6a80095300b9e9f64bc620aa50023
Author: Burak Yavuz <brk...@gmail.com>
Date:   2016-09-16T23:00:26Z

    make StructuredStreaming fileSource batch generation faster

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to