GitHub user marmbrus opened a pull request:
https://github.com/apache/spark/pull/11006
[SPARK-10820][SQL] Support for the continuous execution of structured
queries
This is a follow up to 9aadcffabd226557174f3ff566927f873c71672e that
extends Spark SQL to allow users to _repeatedly_ optimize and execute
structured queries. A `ContinuousQuery` can be expressed using SQL, DataFrames
or Datasets. The purpose of this PR is only to add some initial infrastructure
which will be extended in subsequent PRs.
## User-facing API
- `sqlContext.streamFrom` and `df.streamTo` return builder objects that are
analogous to the `read/write` interfaces already available to executing queries
in a batch-oriented fashion.
- `ContinuousQuery` provides an interface for interacting with a query that
is currently executing in the background.
## Internal Interfaces
- `StreamExecution` - executes streaming queries in micro-batches
The following are currently internal, but public APIs will be provided in a
future release.
- `Source` - an interface for providers of continually arriving data. A
source must have a notion of an `Offset` that monotonically tracks what data
has arrived. For fault tolerance, a source must be able to replay data given a
start offset.
- `Sink` - an interface that accepts the results of a continuously
executing query. Also responsible for tracking the offset that should be
resumed from in the case of a failure.
## Testing
- `MemoryStream` and `MemorySink` - simple implementations of source and
sink that keep all data in memory and have methods for simulating durability
failures
- `StreamTest` - a framework for performing actions and checking
invariants on a continuous query
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/marmbrus/spark structured-streaming
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11006.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11006
----
commit e238911f494c2db41e36c9cde9dd0b733630b36f
Author: Michael Armbrust <[email protected]>
Date: 2015-12-10T02:51:06Z
first draft
commit d2706b511f254feff7d4558550e403215ffb795f
Author: Michael Armbrust <[email protected]>
Date: 2015-12-11T07:49:33Z
working on state
commit 7a3590fe5dd659d1a47e645eec247e21280a3373
Author: Michael Armbrust <[email protected]>
Date: 2015-12-11T19:26:40Z
working on stateful streaming
commit c8a923831bbfc24f71eb744e36a15e432d6ae067
Author: Michael Armbrust <[email protected]>
Date: 2015-12-12T00:34:19Z
now with event time windows
commit dddd192b7fb9efc59f22009adf61a432b5bafc19
Author: Michael Armbrust <[email protected]>
Date: 2015-12-13T03:16:48Z
some refactoring after talking to ali
commit a0a1e7bbd85d0de3f6d90793b894a8b00f86a7f0
Author: Michael Armbrust <[email protected]>
Date: 2015-12-15T00:47:55Z
docs
commit 15bed31800319547eb616c0cc22c564bf967b7f1
Author: Michael Armbrust <[email protected]>
Date: 2015-12-15T18:04:42Z
start kinesis
commit 89464a91e3954f30b68a1f633e2b9c062e08fa2c
Author: Michael Armbrust <[email protected]>
Date: 2015-12-17T02:18:35Z
some renaming
commit d133dbbed3d3ff7146129cd6532b1f26f2201315
Author: Michael Armbrust <[email protected]>
Date: 2015-12-28T00:01:57Z
WIP: file source
commit e3c4c8301fdcfaaa0bd56ed92a81e4e1d2db64a8
Author: Michael Armbrust <[email protected]>
Date: 2016-01-05T05:39:40Z
Merge remote-tracking branch 'origin/master' into streaming-infra
Conflicts:
sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala
commit 90fa6d30f0bf70a7c0c4bfe0889a07cbdddbf203
Author: Michael Armbrust <[email protected]>
Date: 2016-01-05T05:52:43Z
remove half-baked stateful implementation
commit b1c1dc6ead2c76589e48f4c28ef02e6c4361e549
Author: Michael Armbrust <[email protected]>
Date: 2016-01-05T06:36:46Z
cleanup
commit 92050688699b4c509f5ec5c2b636797cb5ae3c89
Author: Michael Armbrust <[email protected]>
Date: 2016-01-05T23:32:34Z
more docs
commit 7a59f00b4f0d238c33d3250fcfe162d7c2c27c18
Author: Josh Rosen <[email protected]>
Date: 2016-01-06T00:00:48Z
Add circle.yml file.
commit eab186d5780b2946a32041ca17d80323b052efc6
Author: Michael Armbrust <[email protected]>
Date: 2016-01-06T00:48:42Z
rollback changes
commit 20750630bd50bf7d0abad4a23b2a717705115e46
Author: Josh Rosen <[email protected]>
Date: 2016-01-06T00:56:16Z
Try using cached resolution to speed up compilation
commit dabc102271e09bae2ddca1366deb13b20c10ded3
Author: Josh Rosen <[email protected]>
Date: 2016-01-06T00:59:10Z
Use assembly/assembly.
commit 4630a87751bad6660d12645a5a7e56ea39dad48b
Author: Josh Rosen <[email protected]>
Date: 2016-01-06T01:10:02Z
Add test:compile so that test dependencies are also cached.
commit cd575db5bb4ab1e954f80a3b0d044ab492394443
Author: Michael Armbrust <[email protected]>
Date: 2016-01-06T01:10:23Z
some feedback
commit 03f7c5daeb1cc9ede78651a7722923dbfadc868c
Author: Josh Rosen <[email protected]>
Date: 2016-01-06T01:34:14Z
Disable cached resolution.
commit 0760d16cfde55382d412ad002f019ceb848a6188
Author: Josh Rosen <[email protected]>
Date: 2016-01-06T02:20:48Z
Run assembly and test in separate commands
commit 90125826d744a2fdff4743cd33e5d9ce531aab1d
Author: Michael Armbrust <[email protected]>
Date: 2016-01-06T03:44:13Z
Merge pull request #22 from marmbrus/circle-ci
Add circle.yml file for configuring CircleCI
commit d233eaed9b9ec98a8155039a46d6e8b6b8a4d6ea
Author: Michael Armbrust <[email protected]>
Date: 2016-01-06T04:41:13Z
comments
commit 3423dce332989c7997b1137d2a708fc6d75ec244
Author: Michael Armbrust <[email protected]>
Date: 2016-01-06T05:13:06Z
Merge remote-tracking branch 'marmbrus/streaming-df' into streaming-infra
commit b0b20e50d4f712763a7f6edea8af7a39edeffe33
Author: Michael Armbrust <[email protected]>
Date: 2016-01-06T06:16:16Z
Update circle.yml
commit f5d9642859a2862abfff924e700fbacf52807004
Author: Michael Armbrust <[email protected]>
Date: 2016-01-06T06:17:53Z
Update SparkBuild.scala
commit f8911e4d6a6d32044b55212ae187eb4251bdfe02
Author: Michael Armbrust <[email protected]>
Date: 2016-01-06T06:44:31Z
Update circle.yml
commit 6387781c63597680e341244d6347e9760c56c808
Author: Josh Rosen <[email protected]>
Date: 2016-01-06T07:03:01Z
Add newlines to satisfy Scalastyle
commit 6242bc2eaecae4645622604d6e0de9802b472715
Author: Michael Armbrust <[email protected]>
Date: 2016-01-06T20:01:45Z
revert CI changes
commit 95fd97807968c2437f104dc3e4578abad14f9554
Author: Michael Armbrust <[email protected]>
Date: 2016-01-06T21:38:53Z
update based on TD's comments
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]