Hi guys, regarding the new (made from scratch) spark runner POC based on the dataset API, I was able to make a big step forward: it can now run a first batch pipeline with a source !
See https://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-structured-streaming/src/test/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/SourceTest.java there is no test facilities for now, testmode is enabled and it just prints the output PCollection . I made some workarounds especially String serialization to pass beam objects (was forced to) and also bytes serialization to comply to spark dataset schemas to store windowedValues. Can you give me your thoughts especially regarding these last 2 matters? The other parts are not ready for showing yet Here is the whole branch: https://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-structured-streaming Thanks, Etienne