[spark runner based on dataset POC] your opinion

Etienne Chauchot Tue, 15 Jan 2019 08:54:38 -0800

Hi guys,
regarding the new (made from scratch) spark runner POC based on the dataset 
API, I was able to make a big step forward:
it can now run a first batch pipeline with a source !


See 
https://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-structured-streaming/src/test/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/SourceTest.java

there is no test facilities for now, testmode is enabled and it just prints the 
output PCollection .

I made some workarounds especially String serialization to pass beam objects 
(was forced to) and also bytes
serialization to comply to spark dataset schemas to store windowedValues.

Can you give me your thoughts especially regarding these last 2 matters?

The other parts are not ready for showing yet

Here is the whole branch:

https://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-structured-streaming

Thanks,

Etienne

[spark runner based on dataset POC] your opinion

Reply via email to