Re: Apache Beam and Spark

Jean-Baptiste Onofré Mon, 13 Nov 2017 03:28:27 -0800

Hi,

my target is to have Spark 2.x support in Beam 2.3.0.


Regards
JB

On 11/13/2017 12:22 PM, Nishu wrote:

Hi Jean,

Thanks for your response.  So when can we expect Spark 2.x support for
spark runner?

Thanks,
Nishu

On Mon, Nov 13, 2017 at 11:53 AM, Jean-Baptiste Onofré <[email protected]>
wrote:

Hi,

Regarding your question:

1. Not yet, but as you might have seen on the mailing list, we have a PR
about Spark 2.x support.

2. We have additional triggers supported and in progress. GroupByKey and
Accumator are also supported.

3. No, I did the change to both allows you to define the default storage
level (via the pipeline options). The runner also automatically define when
to persist a RDD by analyzing the DAG.

4. Yes, it's supported.

Regards
JB

On 11/13/2017 10:50 AM, Nishu wrote:

Hi Team,

I am writing a streaming pipeline in Apache beam using spark runner.
Use case : To join the multiple kafka streams using windowed collections.
I use GroupByKey to group the events based on common business key and that
output is used as input for Join operation. Pipeline run on direct runner
as expected but on Spark cluster(v2.1), it throws the Accumulator error.
*"Exception in thread "main" java.lang.AssertionError: assertion failed:
copyAndReset must return a zero value copy"*

I tried the same pipeline on Spark cluster(v1.6), there it runs without
any
error but doesn't perform the join operations on the streams .

I got couple of questions.

1. Does spark runner support spark version 2.x?

2. Regarding the triggers, currently only ProcessingTimeTrigger is
supported in Capability Matrix
<https://beam.apache.org/documentation/runners/capability-
matrix/#cap-summary-what>
,
can we expect to have support for more trigger in near future sometime
soon
? Also, GroupByKey and Accumulating panes features, are those supported
for
spark for streaming pipeline?

3. According to the documentation, Storage level
<https://beam.apache.org/documentation/runners/spark/#pipeli
ne-options-for-the-spark-runner>
is set to IN_MEMORY for streaming pipelines. Can we configure it to disk
as
well?

4. Is there checkpointing feature supported for Spark runner? In case if
Beam pipeline fails unexpectedly, can we read the state from the last run.

It will be great if someone could help to know above.

--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Apache Beam and Spark

Reply via email to