GitHub user tdas opened a pull request:

    https://github.com/apache/spark/pull/21571

    [WIP][SPARK-24565][SS] Add API for in Structured Streaming for exposing 
output rows of each microbatch as a DataFrame

    ## What changes were proposed in this pull request?
    
    Currently, the micro-batches in the MicroBatchExecution is not exposed to 
the user through any public API. This was because we did not want to expose the 
micro-batches, so that all the APIs we expose, we can eventually support them 
in the Continuous engine. But now that we have better sense of buiding a 
ContinuousExecution, I am considering adding APIs which will run only the 
MicroBatchExecution. I have quite a few use cases where exposing the microbatch 
output as a dataframe is useful. 
    - Pass the output rows of each batch to a library that is designed only the 
batch jobs (example, uses many ML libraries need to collect() while learning).
    - Reuse batch data sources for output whose streaming version does not 
exists (e.g. redshift data source).
    - Writer the output rows to multiple places by writing twice for each 
batch. This is not the most elegant thing to do for multiple-output streaming 
queries but is likely to be better than running two streaming queries 
processing the same data twice.
    
    The proposal is to add a method `foreachBatch(f: Dataset[T] => Unit)` to 
Scala/Java/Python `DataStreamWriter`.
    
    ## How was this patch tested?
    New unit tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tdas/spark foreachBatch

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21571.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21571
    
----
commit 21acc73810a5385f04ce540d0f8b1a575aea5da6
Author: Tathagata Das <tathagata.das1565@...>
Date:   2018-06-08T04:13:19Z

    First cut of Scala foreachBatch

commit 4ac056e87fee94ed0b2742cc1e8644efc998a3ea
Author: Tathagata Das <tathagata.das1565@...>
Date:   2018-06-08T09:27:36Z

    Added python support

commit 3b7b20d5c1b662abd935c0a812d9e54f8ab01b24
Author: Tathagata Das <tathagata.das1565@...>
Date:   2018-06-08T09:36:49Z

    Merge remote-tracking branch 'apache-github/master' into foreachBatch

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to