GitHub user joseph-torres opened a pull request: https://github.com/apache/spark/pull/19925
[SPARK-22732] Add Structured Streaming APIs to DataSourceV2 ## What changes were proposed in this pull request? This PR provides DataSourceV2 API support for structured streaming, including new pieces needed to support continuous processing [SPARK-20928]. High level summary: - DataSourceV2 includes new mixins to support micro-batch and continuous reads and writes. For reads, we accept an optional user specified schema rather than using the ReadSupportWithSchema model, because doing so would severely complicate the interface. - DataSourceV2Reader includes new interfaces to read a specific microbatch or read continuously from a given offset. These follow the same setter pattern as the existing Supports* mixins so that they can work with SupportsScanUnsafeRow. - DataReader (the per-partition reader) has a new subinterface ContinuousDataReader only for continuous processing. This reader has a special method to check progress, and next() blocks for new input rather than returning false. - Offset, an abstract representation of position in a streaming query, is ported to the public API. (Each type of reader will define its own Offset implementation.) - DataSourceV2Writer has a new subinterface ContinuousWriter only for continuous processing. Commits to this interface come tagged with an epoch number, as the execution engine will continue to produce new epoch commits as the task continues indefinitely. Note that this PR does not propose to change the existing DataSourceV2 batch API, or deprecate the existing streaming source/sink internal APIs in spark.sql.execution.streaming. ## How was this patch tested? Toy implementations of the new interfaces with unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/joseph-torres/spark continuous-api Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19925.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19925 ---- commit daa3a78ad4dd7ecfc73f5b1dd050388c07b42771 Author: Jose Torres <j...@databricks.com> Date: 2017-12-05T18:48:20Z add tests commit edae89508ec2bf02fba00a264cb774b0d60fb068 Author: Jose Torres <j...@databricks.com> Date: 2017-12-05T19:35:36Z writer impl commit 9b28c524b343018d20d2d8d3c9ed4d3c530c413f Author: Jose Torres <j...@databricks.com> Date: 2017-12-05T19:37:24Z rm useless writer commit 7ceda9d63b9914cfd275fc4240fa9c696afa05d1 Author: Jose Torres <j...@databricks.com> Date: 2017-12-05T21:02:32Z rm weird docs commit ff7be6914560968af7f2179c3704446c771fad52 Author: Jose Torres <j...@databricks.com> Date: 2017-12-05T21:59:50Z shuffle around public interfaces commit 4ae516a61af903c37b748a3941c2472d20776ce4 Author: Jose Torres <j...@databricks.com> Date: 2017-12-05T22:02:01Z fix imports commit a8ff2ee9eeb992f6c0806cb2b4f33b976ef51cf5 Author: Jose Torres <j...@databricks.com> Date: 2017-12-05T22:40:15Z put deserialize in reader so we don't have to port SerializedOffset commit 5096d3d551aa4479bfb112b286683e28ec578f3c Author: Jose Torres <j...@databricks.com> Date: 2017-12-05T23:51:08Z off by one errors grr commit da00f6b5ddac8bd6025076a67fd4716d9d070bf7 Author: Jose Torres <j...@databricks.com> Date: 2017-12-05T23:55:58Z document right semantics commit 1526f433837de78f59009b6632b6920de38bb1b0 Author: Jose Torres <j...@databricks.com> Date: 2017-12-06T00:08:54Z document checkpoint location commit 33b619ca4f9aa1a82e3830c6e485b8298ca9ff50 Author: Jose Torres <j...@databricks.com> Date: 2017-12-06T00:43:36Z add getStart to continuous and clarify semantics commit 083b04004f58358b3f6e4c82b4690ca5cf2da764 Author: Jose Torres <j...@databricks.com> Date: 2017-12-06T17:23:34Z cleanup offset set/get docs commit 4d6244d2ae431f6043de97f3255552ce1c33090c Author: Jose Torres <j...@databricks.com> Date: 2017-12-06T17:32:45Z cleanup reader docs commit 5f9df4f1b54cbd0570d0df5567c42ac2575009a5 Author: Jose Torres <j...@databricks.com> Date: 2017-12-06T18:06:44Z explain getOffset commit a2323e95ff2d407877ded07b7537bac5b63dda8f Author: Jose Torres <j...@databricks.com> Date: 2017-12-06T21:17:43Z fix fmt commit b80c75cd698cbe4840445efb78a662f02f355a99 Author: Jose Torres <j...@databricks.com> Date: 2017-12-06T21:24:35Z fix doc commit 03bd69da4b0450e5fec88f4196998e3075e98edc Author: Jose Torres <j...@databricks.com> Date: 2017-12-06T21:39:20Z note interfaces are temporary commit c7bc6a37914312666259bb9724aa7103926e4c0f Author: Jose Torres <j...@databricks.com> Date: 2017-12-06T21:43:38Z fix wording commit 7e238c8b4d4477daed4bbfa0cfde1cee2df84705 Author: Jose Torres <j...@databricks.com> Date: 2017-12-06T22:40:48Z lifecycle commit 51fdd9c7ef0dd44798d3724bbd9cb8e31f9deea5 Author: Jose Torres <j...@databricks.com> Date: 2017-12-06T22:53:50Z fix offset semantic implementation commit e442bbcd748059bbd93f16551aca2737ab409d10 Author: Jose Torres <j...@databricks.com> Date: 2017-12-07T22:09:49Z remove unneeded restriction commit 1fdb2cc5484312fe55961a20ebc4cf553050949f Author: Jose Torres <j...@databricks.com> Date: 2017-12-07T22:39:16Z deserializeOffset ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org