Hi Kenn, Sure, in spark DataSourceV2 providing a schema is mandatory:- if I set it to null, I obviously get a NPE- if I set it empty: I get an array out of bounds exception- if I set it to Datatype.Null, null is stored as actual elements => Consequently I set it to binary. As the beam reader is responsible for reading both the element and the timestamp, the source outputs a Dataset<WindowedValue>. So, the solution I found, for which I asked your opinion, is to serialize windowedValue to bytes using beam FullWindowedValueCoder in reader.get() and deserialize the whole dataset once the source is done using a map to get the windowedValue back and give it to the transforms downstream. I am aware that this is not optimal because of the bytes serialization roundtrip, and I wanted your suggestions around that. ThanksEtienne
Le mercredi 16 janvier 2019 à 19:04 -0800, Kenneth Knowles a écrit : > Cool! > I don't quite understand the issue in "bytes serialization to comply to spark > dataset schemas to store > windowedValues". Can you say a little more? > > Kenn > On Tue, Jan 15, 2019 at 8:54 AM Etienne Chauchot <[email protected]> wrote: > > Hi guys, > > regarding the new (made from scratch) spark runner POC based on the dataset > > API, I was able to make a big step > > forward: it can now run a first batch pipeline with a source ! > > > > See > > https://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-structured-streaming/src/test/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/SourceTest.java > > > > there is no test facilities for now, testmode is enabled and it just prints > > the output PCollection . > > > > I made some workarounds especially String serialization to pass beam > > objects (was forced to) and also bytes > > serialization to comply to spark dataset schemas to store windowedValues. > > > > Can you give me your thoughts especially regarding these last 2 matters? > > > > The other parts are not ready for showing yet > > > > Here is the whole branch: > > > > https://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-structured-streaming > > > > Thanks, > > > > Etienne > > > >
