Re: Support streaming side-inputs in the Spark runner

tclemons Fri, 02 Oct 2020 14:17:48 -0700

For clarification, is it just streaming side inputs that present an issue for 
SparkRunner or are there other areas that need work?  We've started work on a 
Beam-based project that includes both streaming and batch oriented work and a 
Spark cluster was our choice due to the perception that it could handle both 
types of applications.


However, that would have to be reevaluated if SparkRunner isn't up for 
streaming deployments.  And it seems that SparkStructuredStreamingRunner still 
needs some time before it's a fully-featured solution.  I guess I'm trying to 
get a sense of whether these runners are still being actively developed or were 
they donated by a third-party and are now suffering from bit-rot.

Oct 1, 2020, 10:54 by [email protected]:

> I would suggest trying FlinkRunner as it is a much more complete streaming 
> implementation.
> SparkRunner has several key things that are missing that won't allow your 
> pipeline to function correctly.
> If you're really invested in getting SparkRunner working though feel free to 
> contribute the necessary implementations for watermark holds and broadcast 
> state necessary for side inputs.
>
> On Tue, Sep 29, 2020 at 9:06 AM Rajagopal, Viswanathan <> [email protected]> 
> > wrote:
>
>>
>> Hi Team,
>>
>>
>>  
>>
>>
>> I have a streaming pipeline (built using Apache Beam with Spark Runner)which 
>> consumes events tagged with timestamps from Unbounded source (Kinesis 
>> Stream) and batch them into FixedWindows of 5 mins each and then, write all 
>> events in a window into a single / multiple files based on shards.
>>
>>
>> We are trying to achieve the following through Apache Beam constructs
>>
>>
>> 1.>>        >> Create a PCollectionView from unbounded source and pass it as 
>> a side-input to our main pipeline.
>>
>>
>> 2.>>        >> Have a hook method that invokes per window that enables us to 
>> do some operational activities per window.
>>
>>
>> 3.>>        >> Stop the stream processor (graceful stop) from external 
>> system.
>>
>>
>>  
>>
>>
>> Approaches that we tried for 1).
>>
>>
>> ·>>         >> Creating a PCollectionView from unbounded source and pass it 
>> as a side-input to our main pipeline.
>>
>>
>> ·>>         >> Input Pcollection goes through FixedWindow transform.
>>
>>
>> ·>>         >> Created custom CombineFn that takes combines all inputs for a 
>> window and produce single value Pcollection.
>>
>>
>> ·>>         >> Output of Window transform >>  it goes to CombineFn (custom 
>> fn) and creates a PCollectionView from CombineFn (using 
>> Combine.Globally().asSingletonView() as this output would be passed as a 
>> side-input for our main pipeline.
>>
>>
>> o>>    >> Getting the following exception (while running with streaming 
>> option set to true)
>>
>>
>> ·>>         >> java.lang.IllegalStateException: No TransformEvaluator 
>> registered for UNBOUNDED transform View.CreatePCollectionView
>>
>> Noticed that SparkRunner doesn’t support the streaming side-inputs in the 
>> Spark runner
>> https://www.javatips.net/api/beam-master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/streaming/StreamingTransformTranslator.java>>
>>   (View.CreatePCollectionView.class not added to EVALUATORS Map)
>> https://issues.apache.org/jira/browse/BEAM-2112
>> https://issues.apache.org/jira/browse/BEAM-1564
>>
>> So would like to understand on this BEAM-1564 ticket.
>>
>>
>>  
>>
>>
>> Approaches that we tried for 2).
>>
>>
>> Tried to implement the operational activities in extractOutput() of 
>> CombineFn as extractOutput() called once per window. We hadn’t tested this 
>> as this is blocked by Issue 1).
>>
>>
>> Is there any other recommended approaches to implement this feature?
>>
>>
>>  
>>
>>
>> Looking for recommended approaches to implement feature 3).
>>
>>
>>  
>>
>>
>> Many Thanks,
>>
>>
>> Viswa.
>>
>>
>>  
>>
>>
>>  
>>
>>
>>  
>>
>>
>>  
>>
>>

Re: Support streaming side-inputs in the Spark runner

Reply via email to