[
https://issues.apache.org/jira/browse/SPARK-24565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tathagata Das updated SPARK-24565:
----------------------------------
Description:
Currently, the micro-batches in the MicroBatchExecution is not exposed to the
user through any public API. This was because we did not want to expose the
micro-batches, so that all the APIs we expose, we can eventually support them
in the Continuous engine. But now that we have a better sense of building a
ContinuousExecution, I am considering adding APIs which will run only the
MicroBatchExecution. I have quite a few use cases where exposing the
micro-batch output as a dataframe is useful.
- Pass the output rows of each batch to a library that is designed only the
batch jobs (example, uses many ML libraries need to collect() while learning).
- Reuse batch data sources for output whose streaming version does not exist
(e.g. redshift data source).
- Writer the output rows to multiple places by writing twice for each batch.
This is not the most elegant thing to do for multiple-output streaming queries
but is likely to be better than running two streaming queries processing the
same data twice.
The proposal is to add a method {{foreachBatch(f: Dataset[T] => Unit)}} to
Scala/Java/Python {{DataStreamWriter}}.
was:
Currently, the micro-batches in the MicroBatchExecution is not exposed to the
user through any public API. This was because we did not want to expose the
micro-batches, so that all the APIs we expose, we can eventually support them
in the Continuous engine. But now that we have a better sense of building a
ContinuousExecution, I am considering adding APIs which will run only the
MicroBatchExecution. I have quite a few use cases where exposing the
micro-batch output as a dataframe is useful.
- Pass the output rows of each batch to a library that is designed only the
batch jobs (example, uses many ML libraries need to collect() while learning).
- Reuse batch data sources for output whose streaming version does not exist
(e.g. redshift data source).
- Writer the output rows to multiple places by writing twice for each batch.
This is not the most elegant thing to do for multiple-output streaming queries
but is likely to be better than running two streaming queries processing the
same data twice.
The proposal is to add a method {{foreachBatch(f: Dataset[T] => Unit)}} to
Scala/Java/Python `DataStreamWriter`.
> Add API for in Structured Streaming for exposing output rows of each
> microbatch as a DataFrame
> ----------------------------------------------------------------------------------------------
>
> Key: SPARK-24565
> URL: https://issues.apache.org/jira/browse/SPARK-24565
> Project: Spark
> Issue Type: Improvement
> Components: Structured Streaming
> Affects Versions: 2.3.0
> Reporter: Tathagata Das
> Assignee: Tathagata Das
> Priority: Major
>
> Currently, the micro-batches in the MicroBatchExecution is not exposed to the
> user through any public API. This was because we did not want to expose the
> micro-batches, so that all the APIs we expose, we can eventually support them
> in the Continuous engine. But now that we have a better sense of building a
> ContinuousExecution, I am considering adding APIs which will run only the
> MicroBatchExecution. I have quite a few use cases where exposing the
> micro-batch output as a dataframe is useful.
> - Pass the output rows of each batch to a library that is designed only the
> batch jobs (example, uses many ML libraries need to collect() while learning).
> - Reuse batch data sources for output whose streaming version does not exist
> (e.g. redshift data source).
> - Writer the output rows to multiple places by writing twice for each batch.
> This is not the most elegant thing to do for multiple-output streaming
> queries but is likely to be better than running two streaming queries
> processing the same data twice.
> The proposal is to add a method {{foreachBatch(f: Dataset[T] => Unit)}} to
> Scala/Java/Python {{DataStreamWriter}}.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]