[jira] [Commented] (BEAM-8470) Create a new Spark runner based on Spark Structured streaming framework

Yun Tang (Jira) Mon, 15 Nov 2021 01:16:39 -0800


    [ 
https://issues.apache.org/jira/browse/BEAM-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443663#comment-17443663
 ]


Yun Tang commented on BEAM-8470:
--------------------------------

It seems the new SparkStructuredStreamingRunner still does not support 
streaming mode, does it mean we cannot run nexmark with spark structed 
streaming? [~echauchot]

> Create a new Spark runner based on Spark Structured streaming framework
> -----------------------------------------------------------------------
>
>                 Key: BEAM-8470
>                 URL: https://issues.apache.org/jira/browse/BEAM-8470
>             Project: Beam
>          Issue Type: New Feature
>          Components: runner-spark
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>            Priority: P2
>              Labels: structured-streaming
>             Fix For: 2.18.0
>
>          Time Spent: 16h
>  Remaining Estimate: 0h
>
> h1. Why is it worth creating a new runner based on structured streaming:
> Because this new framework brings:
>  * Unified batch and streaming semantics:
>  * no more RDD/DStream distinction, as in Beam (only PCollection)
>  * Better state management:
>  * incremental state instead of saving all each time
>  * No more synchronous saving delaying computation: per batch and partition 
> delta file saved asynchronously + in-memory hashmap synchronous put/get
>  * Schemas in datasets:
>  * The dataset knows the structure of the data (fields) and can optimize 
> later on
>  * Schemas in PCollection in Beam
>  * New Source API
>  * Very close to Beam bounded source and unbounded sources
> h1. Why make a new runner from scratch?
>  * Structured streaming framework is very different from the RDD/Dstream 
> framework
> h1. We hope to gain
>  * More up to date runner in terms of libraries: leverage new features
>  * Leverage learnt practices from the previous runners
>  * Better performance thanks to the DAG optimizer (catalyst) and by 
> simplifying the code.
>  * Simplify the code and ease the maintenance
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (BEAM-8470) Create a new Spark runner based on Spark Structured streaming framework

Reply via email to