Re: Pointers on Contributing to Structured Streaming Spark Runner

Alexey Romanenko Tue, 17 Sep 2019 03:36:48 -0700

Hi Xinyu,

Great to hear that you wish to contribute into new Spark runner! We used to 
have the sync meetings about all Spark runners in general every two weeks, so 
feel free to let know us if you want to participate too.


Also, as one of the contributors for Structural Streaming Spark runner (yes, we 
need to find a shorter way how to call it =), I agree that it’s a good time to 
merge it into master (even if it’s not 100% ready). Then we can create a 
roadmap with Jiras tasks and push code in normal PR-based way, so it will be 
easier to discover new changes and track the work progress. We only need to 
prevent users that it’s still under development and not ready to use in 
production.

Scheme part is still quite vague and hazy, so it’s a good topic for separate 
discussion. I believe that it would be much effective in terms of performance 
if we will be able to have a strong relation between Beam and Spark schemes in 
the end. 

> On 13 Sep 2019, at 21:16, Xinyu Liu <[email protected]> wrote:
> 
> Hi, Etienne,
> 
> The slides are very informative! Thanks for sharing the details about how the 
> Beam API are mapped into Spark Structural Streaming. We (LinkedIn) are also 
> interested in trying the new SparkRunner to run Beam pipeine in batch, and 
> contribute to it too. From my understanding, seems the functionality on batch 
> side is mostly complete and covers quite a large percentage of the tests (a 
> few missing pieces like state and timer in ParDo and SDF). If so, is it 
> possible to merge the new runner sooner into master so it's much easier for 
> us to pull it in (we have an internal fork) and contribute back?
> 
> Also curious about the scheme part in the runner. Seems we can leverage the 
> schema-aware work in PCollection and translate from Beam schema to Spark, so 
> it can be optimized in the planner layer. It will be great to hear back your 
> plans on that.
> 
> Congrats on this great work!
> Thanks,
> Xinyu
> 
> On Wed, Sep 11, 2019 at 6:02 PM Rui Wang <[email protected] 
> <mailto:[email protected]>> wrote:
> Hello Etienne,
> 
> Your slide mentioned that streaming mode development is blocked because Spark 
> lacks supporting multiple-aggregations in its streaming mode but design is 
> ongoing. Do you have a link or something else to their design discussion/doc?
> 
> 
> -Rui  
> 
> On Wed, Sep 11, 2019 at 5:10 PM Etienne Chauchot <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Rahul,
> Sure, and great ! Thanks for proposing !
> If you want details, here is the presentation I did 30 mins ago at the 
> apachecon. You will find the video on youtube shortly but in the meantime, 
> here is my presentation slides.
> 
> And here is the structured streaming branch. I'll be happy to review your 
> PRs, thanks !
> 
>  
> <https://github.com/apache/beam/tree/spark-runner_structured-streaming>https://github.com/apache/beam/tree/spark-runner_structured-streaming
>  <https://github.com/apache/beam/tree/spark-runner_structured-streaming>
> 
> Best
> Etienne
> 
> Le mercredi 11 septembre 2019 à 16:37 +0530, rahul patwari a écrit :
>> Hi Etienne,
>> 
>> I came to know about the work going on in Structured Streaming Spark Runner 
>> from Apache Beam Wiki - Works in Progress.
>> I have contributed to BeamSql earlier. And I am working on supporting 
>> PCollectionView in BeamSql.
>> 
>> I would love to understand the Runner's side of Apache Beam and contribute 
>> to the Structured Streaming Spark Runner.
>> 
>> Can you please point me in the right direction?
>> 
>> Thanks,
>> Rahul

Re: Pointers on Contributing to Structured Streaming Spark Runner

Reply via email to