[GitHub] spark pull request #15122: [SPARK-17569] Make StructuredStreaming FileStream...

brkyvz Fri, 16 Sep 2016 16:05:45 -0700

GitHub user brkyvz opened a pull request:

    https://github.com/apache/spark/pull/15122


    [SPARK-17569] Make StructuredStreaming FileStreamSource batch generation 
faster

    ## What changes were proposed in this pull request?
    
    While getting the batch for a `FileStreamSource` in StructuredStreaming, we 
know which files we must take specifically. We already have verified that they 
exist, and have committed them to a metadata log. When creating the 
FileSourceRelation however for an incremental execution, the code checks the 
existence of every single file once again!
    
    When you have 100,000s of files in a folder, creating the first batch takes 
2 hours+ when working with S3! This PR disables that check
    
    
    ## How was this patch tested?
    
    Couldn't find any easy way to add a unit test for this. It's not easy to 
add a mock `FileSystem`, because the `FileSystem` instance is generated per 
file in the method itself. Only way to work around it maybe would be to expose 
a method for testing that generates the `FileSystem`?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/brkyvz/spark SPARK-17569

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15122.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15122
    
----
commit 9b7e2de8d9b6a80095300b9e9f64bc620aa50023
Author: Burak Yavuz <[email protected]>
Date:   2016-09-16T23:00:26Z

    make StructuredStreaming fileSource batch generation faster

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15122: [SPARK-17569] Make StructuredStreaming FileStream...

Reply via email to