Re: [spark runner based on dataset POC] your opinion

Etienne Chauchot Thu, 17 Jan 2019 01:25:36 -0800

Hi Kenn,
Sure, in spark DataSourceV2 providing a schema is mandatory:- if I set it to 
null, I obviously get a NPE- if I set it
empty: I get an array out of bounds exception- if I set it to Datatype.Null, 
null is stored as actual elements =>
Consequently I set it to binary.
As the beam reader is responsible for reading both the element and the 
timestamp, the source outputs a
Dataset<WindowedValue>. So, the solution I found, for which I asked your 
opinion, is to serialize windowedValue to
bytes using beam FullWindowedValueCoder in reader.get() and deserialize the 
whole dataset once the source is done using
a map to get the windowedValue back and give it to the transforms downstream.
I am aware that this is not optimal because of the bytes serialization 
roundtrip, and I wanted your suggestions around
that.
ThanksEtienne


Le mercredi 16 janvier 2019 à 19:04 -0800, Kenneth Knowles a écrit :
> Cool!
> I don't quite understand the issue in "bytes serialization to comply to spark 
> dataset schemas to store
> windowedValues". Can you say a little more?
> 
> Kenn
> On Tue, Jan 15, 2019 at 8:54 AM Etienne Chauchot <[email protected]> wrote:
> > Hi guys,
> > regarding the new (made from scratch) spark runner POC based on the dataset 
> > API, I was able to make a big step
> > forward: it can now run a first batch pipeline with a source !
> > 
> > See 
> > https://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-structured-streaming/src/test/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/SourceTest.java
> > 
> > there is no test facilities for now, testmode is enabled and it just prints 
> > the output PCollection .
> > 
> > I made some workarounds especially String serialization to pass beam 
> > objects (was forced to) and also bytes
> > serialization to comply to spark dataset schemas to store windowedValues.
> > 
> > Can you give me your thoughts especially regarding these last 2 matters?
> > 
> > The other parts are not ready for showing yet
> > 
> > Here is the whole branch:
> > 
> > https://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-structured-streaming
> > 
> > Thanks,
> > 
> > Etienne
> > 
> >

Re: [spark runner based on dataset POC] your opinion

Reply via email to