[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...

joseph-torres Thu, 07 Dec 2017 14:40:11 -0800

GitHub user joseph-torres opened a pull request:

    https://github.com/apache/spark/pull/19925


    [SPARK-22732] Add Structured Streaming APIs to DataSourceV2

    ## What changes were proposed in this pull request?
    
    This PR provides DataSourceV2 API support for structured streaming, 
including new pieces needed to support continuous processing [SPARK-20928]. 
High level summary:
    
    - DataSourceV2 includes new mixins to support micro-batch and continuous 
reads and writes. For reads, we accept an optional user specified schema rather 
than using the ReadSupportWithSchema model, because doing so would severely 
complicate the interface.
    
    - DataSourceV2Reader includes new interfaces to read a specific microbatch 
or read continuously from a given offset. These follow the same setter pattern 
as the existing Supports* mixins so that they can work with 
SupportsScanUnsafeRow.
    
    - DataReader (the per-partition reader) has a new subinterface 
ContinuousDataReader only for continuous processing. This reader has a special 
method to check progress, and next() blocks for new input rather than returning 
false.
    
    - Offset, an abstract representation of position in a streaming query, is 
ported to the public API. (Each type of reader will define its own Offset 
implementation.)
    
    - DataSourceV2Writer has a new subinterface ContinuousWriter only for 
continuous processing. Commits to this interface come tagged with an epoch 
number, as the execution engine will continue to produce new epoch commits as 
the task continues indefinitely.
    
    Note that this PR does not propose to change the existing DataSourceV2 
batch API, or deprecate the existing streaming source/sink internal APIs in 
spark.sql.execution.streaming.
    
    ## How was this patch tested?
    
    Toy implementations of the new interfaces with unit tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/joseph-torres/spark continuous-api

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19925.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19925
    
----
commit daa3a78ad4dd7ecfc73f5b1dd050388c07b42771
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-05T18:48:20Z

    add tests

commit edae89508ec2bf02fba00a264cb774b0d60fb068
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-05T19:35:36Z

    writer impl

commit 9b28c524b343018d20d2d8d3c9ed4d3c530c413f
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-05T19:37:24Z

    rm useless writer

commit 7ceda9d63b9914cfd275fc4240fa9c696afa05d1
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-05T21:02:32Z

    rm weird docs

commit ff7be6914560968af7f2179c3704446c771fad52
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-05T21:59:50Z

    shuffle around public interfaces

commit 4ae516a61af903c37b748a3941c2472d20776ce4
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-05T22:02:01Z

    fix imports

commit a8ff2ee9eeb992f6c0806cb2b4f33b976ef51cf5
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-05T22:40:15Z

    put deserialize in reader so we don't have to port SerializedOffset

commit 5096d3d551aa4479bfb112b286683e28ec578f3c
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-05T23:51:08Z

    off by one errors grr

commit da00f6b5ddac8bd6025076a67fd4716d9d070bf7
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-05T23:55:58Z

    document right semantics

commit 1526f433837de78f59009b6632b6920de38bb1b0
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-06T00:08:54Z

    document checkpoint location

commit 33b619ca4f9aa1a82e3830c6e485b8298ca9ff50
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-06T00:43:36Z

    add getStart to continuous and clarify semantics

commit 083b04004f58358b3f6e4c82b4690ca5cf2da764
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-06T17:23:34Z

    cleanup offset set/get docs

commit 4d6244d2ae431f6043de97f3255552ce1c33090c
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-06T17:32:45Z

    cleanup reader docs

commit 5f9df4f1b54cbd0570d0df5567c42ac2575009a5
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-06T18:06:44Z

    explain getOffset

commit a2323e95ff2d407877ded07b7537bac5b63dda8f
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-06T21:17:43Z

    fix fmt

commit b80c75cd698cbe4840445efb78a662f02f355a99
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-06T21:24:35Z

    fix doc

commit 03bd69da4b0450e5fec88f4196998e3075e98edc
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-06T21:39:20Z

    note interfaces are temporary

commit c7bc6a37914312666259bb9724aa7103926e4c0f
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-06T21:43:38Z

    fix wording

commit 7e238c8b4d4477daed4bbfa0cfde1cee2df84705
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-06T22:40:48Z

    lifecycle

commit 51fdd9c7ef0dd44798d3724bbd9cb8e31f9deea5
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-06T22:53:50Z

    fix offset semantic implementation

commit e442bbcd748059bbd93f16551aca2737ab409d10
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-07T22:09:49Z

    remove unneeded restriction

commit 1fdb2cc5484312fe55961a20ebc4cf553050949f
Author: Jose Torres <j...@databricks.com>
Date:   2017-12-07T22:39:16Z

    deserializeOffset

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19925: [SPARK-22732] Add Structured Streaming APIs to Da...

Reply via email to