Hi guys,

You probably know that there has been for several months an work developing a new Spark runner based on Spark Structured Streaming framework. This work is located in a feature branch here: https://github.com/apache/beam/tree/spark-runner_structured-streaming

To attract more contributors and get some user feedback, we think it is time to merge it to master. Before doing so, some steps need to be achieved:

- finish the work on spark Encoders (that allow to call Beam coders) because, right now, the runner is in an unstable state (some transforms use the new way of doing ser/de and some use the old one, making a pipeline incoherent toward serialization)

- clean history: The history contains commits from November 2018, so there is a good amount of work, thus a consequent number of commits. They were already squashed but not from September 2019

Regarding status:

- the runner passes 89% of the validates runner tests in batch mode. We hope to pass more with the new Encoders

- Streaming mode is barely started (waiting for the multi-aggregations support in spark SS framework from the Spark community)

- Runner can execute Nexmark

- Some things are not wired up yet

    - Beam Schemas not wired with Spark Schemas

    - Optional features of the model not implemented:  state api, timer api, splittable doFn api, …

WDYT, can we merge it to master once the 2 steps are done ?

Best

Etienne

Reply via email to