Re: Beam spark 2.x runner status

Ismaël Mejía Wed, 15 Mar 2017 08:47:41 -0700

> So you propose to have the Spark 2 branch a clone of the current one with
> adaptations around Context->Session, Accumulator->AccumulatorV2 etc. while
> still using the RDD API ?


Yes this is exactly what I have in mind.

> I think that having another Spark runner is great if it has value,
> otherwise, let's just bump the version.

There is value because most people are already starting to move to
spark 2 and all Big Data distribution providers support it now, as
well as the Cloud-based distributions (Dataproc and EMR) not like the
last time we had this discussion.

> We could think of starting to migrate the Spark 1 runner to Spark 2 and
> follow with Dataset API support feature-by-feature as ot advances, but I
> think most Spark installations today still run 1.X, or am I wrong ?

No, you are right, that’s why I didn’t even mentioned removing the
spark 1 runner, I know that having to support things for both versions
can add additional work for us, but maybe the best approach would be
to continue the work only in the spark 2 runner (both refining the RDD
based translator and starting to create the Dataset one there that
co-exist until the DataSet API is mature enough) and keep the spark 1
runner only for bug-fixes for the users who are still using it (like
this we don’t have to keep backporting stuff). Do you see any other
particular issue?

Ismaël

On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com> wrote:
> So you propose to have the Spark 2 branch a clone of the current one with
> adaptations around Context->Session, Accumulator->AccumulatorV2 etc. while
> still using the RDD API ?
>
> I think that having another Spark runner is great if it has value,
> otherwise, let's just bump the version.
> My idea of having another runner for Spark was not to support more versions
> - we should always support the most popular version in terms of
> compatibility - the idea was to try and make Beam work with Structured
> Streaming, which is still not fully mature so that's why we're not heavily
> investing there.
>
> We could think of starting to migrate the Spark 1 runner to Spark 2 and
> follow with Dataset API support feature-by-feature as ot advances, but I
> think most Spark installations today still run 1.X, or am I wrong ?
>
> On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejía <ieme...@gmail.com> wrote:
>
>> BIG +1 JB,
>>
>> If we can just jump the version number with minor changes staying as
>> close as possible to the current implementation for spark 1 we can go
>> faster and offer in principle the exact same support but for version
>> 2.
>>
>> I know that the advanced streaming stuff based on the DataSet API
>> won't be there but with this common canvas the community can iterate
>> to create a DataSet based translator at the same time. In particular I
>> consider the most important thing is that the spark 2 branch should
>> not live for long time, this should be merged into master really fast
>> for the benefit of everybody.
>>
>> Ismaël
>>
>>
>> On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>> > Hi Amit,
>> >
>> > What do you think of the following:
>> >
>> > - in the mean time that you reintroduce the Spark 2 branch, what about
>> > "extending" the version in the current Spark runner ? Still using
>> > RDD/DStream, I think we can support Spark 2.x even if we don't yet
>> leverage
>> > the new provided features.
>> >
>> > Thoughts ?
>> >
>> > Regards
>> > JB
>> >
>> >
>> > On 03/15/2017 07:39 PM, Amit Sela wrote:
>> >>
>> >> Hi Cody,
>> >>
>> >> I will re-introduce this branch soon as part of the work on BEAM-913
>> >> <https://issues.apache.org/jira/browse/BEAM-913>.
>> >> For now, and from previous experience with the mentioned branch, batch
>> >> implementation should be straight-forward.
>> >> Only issue is with streaming support - in the current runner (Spark 1.x)
>> >> we
>> >> have experimental support for windows/triggers and we're working towards
>> >> full streaming support.
>> >> With Spark 2.x, there is no "general-purpose" stateful operator for the
>> >> Dataset API, so I was waiting to see if the new operator
>> >> <https://github.com/apache/spark/pull/17179> planned for next version
>> >> could
>> >> help with that.
>> >>
>> >> To summarize, I will introduce a skeleton for the Spark 2 runner with
>> >> batch
>> >> support as soon as I can as a separate branch.
>> >>
>> >> Thanks,
>> >> Amit
>> >>
>> >> On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere <e.neve...@gmail.com>
>> >> wrote:
>> >>
>> >>> Hi guys,
>> >>> Is there anybody who's currently working on Spark 2.x runner? A old PR
>> >>> for
>> >>> spark 2.x runner was closed a few days ago, so I wonder what's the
>> status
>> >>> now, and is there a roadmap for this?
>> >>> Thanks~
>> >>>
>> >>
>> >
>> > --
>> > Jean-Baptiste Onofré
>> > jbono...@apache.org
>> > http://blog.nanthrax.net
>> > Talend - http://www.talend.com
>>

Re: Beam spark 2.x runner status

Reply via email to