[GitHub] spark pull request: [SPARK-14832][SQL][STREAMING]

tdas Thu, 21 Apr 2016 17:55:29 -0700

GitHub user tdas opened a pull request:

    https://github.com/apache/spark/pull/12591


    [SPARK-14832][SQL][STREAMING] 

    ## What changes were proposed in this pull request?
    
    When creating a file stream using sqlContext.write.stream(), existing files 
are scanned twice for finding the schema 
    - Once, when creating a DataSource + StreamingRelation in the 
DataFrameReader.stream()
    - Again, when creating streaming Source from the DataSource, in 
DataSource.createSource()
    
    Instead, the schema should be generated only once, at the time of creating 
the dataframe, and when the streaming source is created, it should just reuse 
that schema
    
    The solution proposed in this PR is to add a lazy field in DataSource that 
caches the schema. Then streaming Source created by the DataSource can just 
reuse the schema. 
    
    In addition to the bug fix, there is additional refactoring of tests in 
this PR to fix the following problem. 
    Current StreamTest allows testing of a streaming Dataset generated 
explicitly wraps a source. This is different from the actual production code 
path where the source object is dynamically created through a DataSource object 
every time a query is started. So all the fault-tolerance testing in 
FileSourceSuite and FileSourceStressSuite is not really testing the actual code 
path as they are just reusing the FileStreamSource object. This PR fixes 
StreamTest and the FileSource***Suite to test this correctly. Instead of 
maintaining a mapping of source --> expected offset in StreamTest (which 
requires reuse of source object), it now maintains a mapping of source index 
--> offset, so that it is independent of the source object.
    
    ## How was this patch tested?
    Refactored unit tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tdas/spark SPARK-14832

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12591.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12591
    
----
commit f9a1524ff16e5a92ad0da56c0d2fab36044e2641
Author: Tathagata Das <[email protected]>
Date:   2016-04-21T01:57:03Z

    Added unit test

commit e9c6d607ca8de4bd89f78ede4839ca54c0be6b1c
Author: Tathagata Das <[email protected]>
Date:   2016-04-21T20:21:04Z

    Refactored file source to avoid multiple schema inferences

commit d7efacb959667b260d110ab58251014f29fc0b4b
Author: Tathagata Das <[email protected]>
Date:   2016-04-22T00:33:06Z

    Refactored test code for better testing

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14832][SQL][STREAMING]

Reply via email to