GitHub user tdas opened a pull request:
https://github.com/apache/spark/pull/21571
[WIP][SPARK-24565][SS] Add API for in Structured Streaming for exposing
output rows of each microbatch as a DataFrame
## What changes were proposed in this pull request?
Currently, the micro-batches in the MicroBatchExecution is not exposed to
the user through any public API. This was because we did not want to expose the
micro-batches, so that all the APIs we expose, we can eventually support them
in the Continuous engine. But now that we have better sense of buiding a
ContinuousExecution, I am considering adding APIs which will run only the
MicroBatchExecution. I have quite a few use cases where exposing the microbatch
output as a dataframe is useful.
- Pass the output rows of each batch to a library that is designed only the
batch jobs (example, uses many ML libraries need to collect() while learning).
- Reuse batch data sources for output whose streaming version does not
exists (e.g. redshift data source).
- Writer the output rows to multiple places by writing twice for each
batch. This is not the most elegant thing to do for multiple-output streaming
queries but is likely to be better than running two streaming queries
processing the same data twice.
The proposal is to add a method `foreachBatch(f: Dataset[T] => Unit)` to
Scala/Java/Python `DataStreamWriter`.
## How was this patch tested?
New unit tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tdas/spark foreachBatch
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21571.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21571
----
commit 21acc73810a5385f04ce540d0f8b1a575aea5da6
Author: Tathagata Das <tathagata.das1565@...>
Date: 2018-06-08T04:13:19Z
First cut of Scala foreachBatch
commit 4ac056e87fee94ed0b2742cc1e8644efc998a3ea
Author: Tathagata Das <tathagata.das1565@...>
Date: 2018-06-08T09:27:36Z
Added python support
commit 3b7b20d5c1b662abd935c0a812d9e54f8ab01b24
Author: Tathagata Das <tathagata.das1565@...>
Date: 2018-06-08T09:36:49Z
Merge remote-tracking branch 'apache-github/master' into foreachBatch
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]