> So you're suggesting we copy-paste the current runner and adapt whatever is
> necessary so it runs with Spark 2 ?

Yes

> This also means any bug-fix / improvement would have to be maintained in
> two runners, and I wouldn't wanna do that.

No, this is the reason I first proposed to deprecate the spark 1
runner (and just eventually do bug fixes but no new development on it)
to keep maintenance minimal, so the current line of
development/maintenance would move into the spark 2 RDD version.
Additionally in parallel we can progress on the Dataset based
translation, but this will be considered experimental, so no
maintenance compromises. Finally when the DataSet version is mature we
will get rid of the RDD one.

This has the additional advantage of not having a long living branch
or doing a full write as a starting point.

> I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset API.
> Since the RDD API is mature, it should be the runner in master (not
> preventing another runner once Dataset API is mature enough) and the
> version (1.6.3 or 2.x) should be determined by the common installation.

I agree with you but the reality is that people who run on clusters
need the specific versions of the libraries, this is independent of
the APIs.

Don’t you think that taking the approach I described at least reduce a
little bit the maintenance burden? Of course I understand your
hesitation, but if we decide the exact set of features that won’t be
supported in the spark 1 runner we can branch out from it, of course
again this decision is totally up to your consideration.

On Wed, Mar 15, 2017 at 5:57 PM, Amit Sela <amitsel...@gmail.com> wrote:
> So you're suggesting we copy-paste the current runner and adapt whatever is
> necessary so it runs with Spark 2 ?
> This also means any bug-fix / improvement would have to be maintained in
> two runners, and I wouldn't wanna do that.
>
> I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset API.
> Since the RDD API is mature, it should be the runner in master (not
> preventing another runner once Dataset API is mature enough) and the
> version (1.6.3 or 2.x) should be determined by the common installation.
>
> That's why I believe we still need to leave things as they are, but start
> working on the Dataset API runner.
> Otherwise, we'll have the current runner, another RDD API runner with Spark
> 2, and a third one for the Dataset API. I don't want to maintain all of
> them. It's a mess.
>
> On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía <ieme...@gmail.com> wrote:
>
>> > However, I do feel that we should use the Dataset API, starting with
>> batch
>> > support first. WDYT ?
>>
>> Well, this is the exact current status quo, and it will take us some
>> time to have something as complete as what we have with the spark 1
>> runner for the spark 2.
>>
>> The other proposal has two advantages:
>>
>> One is that we can leverage on the existing implementation (with the
>> needed adjustments) to run Beam pipelines on Spark 2, in the end final
>> users don’t care so much if pipelines are translated via RDD/DStream
>> or Dataset, they just want to know that with Beam they can run their
>> code in their favorite data processing framework.
>>
>> The other advantage is that we can base the work on the latest spark
>> version and advance simultaneously in translators for both APIs, and
>> once we consider that the DataSet is mature enough we can stop
>> maintaining the RDD one and make it the official one.
>>
>> The only missing piece is backporting new developments on the RDD
>> based translator from the spark 2 version into the spark 1, but maybe
>> this won’t be so hard if we consider what you said, that at this point
>> we are getting closer to have streaming right (of course you are the
>> most appropriate person to decide if we are in a sufficient good shape
>> to make this, so backporting things won’t be so hard).
>>
>> Finally I agree with you, I would prefer a nice and full featured
>> translator based on the Structured Streaming API but the question is
>> how much time this will take to be in shape and the impact on final
>> users who are already requesting this. This is the reason why I think
>> the more conservative approach (keeping around the RDD translator) and
>> moving incrementally makes sense.
>>
>> On Wed, Mar 15, 2017 at 4:52 PM, Amit Sela <amitsel...@gmail.com> wrote:
>> > I feel that as we're getting closer to supporting streaming with Spark 1
>> > runner, and having Structured Streaming advance in Spark 2, we could
>> start
>> > work on Spark 2 runner in a separate branch.
>> >
>> > However, I do feel that we should use the Dataset API, starting with
>> batch
>> > support first. WDYT ?
>> >
>> > On Wed, Mar 15, 2017 at 5:47 PM Ismaël Mejía <ieme...@gmail.com> wrote:
>> >
>> >> > So you propose to have the Spark 2 branch a clone of the current one
>> with
>> >> > adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
>> >> while
>> >> > still using the RDD API ?
>> >>
>> >> Yes this is exactly what I have in mind.
>> >>
>> >> > I think that having another Spark runner is great if it has value,
>> >> > otherwise, let's just bump the version.
>> >>
>> >> There is value because most people are already starting to move to
>> >> spark 2 and all Big Data distribution providers support it now, as
>> >> well as the Cloud-based distributions (Dataproc and EMR) not like the
>> >> last time we had this discussion.
>> >>
>> >> > We could think of starting to migrate the Spark 1 runner to Spark 2
>> and
>> >> > follow with Dataset API support feature-by-feature as ot advances,
>> but I
>> >> > think most Spark installations today still run 1.X, or am I wrong ?
>> >>
>> >> No, you are right, that’s why I didn’t even mentioned removing the
>> >> spark 1 runner, I know that having to support things for both versions
>> >> can add additional work for us, but maybe the best approach would be
>> >> to continue the work only in the spark 2 runner (both refining the RDD
>> >> based translator and starting to create the Dataset one there that
>> >> co-exist until the DataSet API is mature enough) and keep the spark 1
>> >> runner only for bug-fixes for the users who are still using it (like
>> >> this we don’t have to keep backporting stuff). Do you see any other
>> >> particular issue?
>> >>
>> >> Ismaël
>> >>
>> >> On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com>
>> wrote:
>> >> > So you propose to have the Spark 2 branch a clone of the current one
>> with
>> >> > adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
>> >> while
>> >> > still using the RDD API ?
>> >> >
>> >> > I think that having another Spark runner is great if it has value,
>> >> > otherwise, let's just bump the version.
>> >> > My idea of having another runner for Spark was not to support more
>> >> versions
>> >> > - we should always support the most popular version in terms of
>> >> > compatibility - the idea was to try and make Beam work with Structured
>> >> > Streaming, which is still not fully mature so that's why we're not
>> >> heavily
>> >> > investing there.
>> >> >
>> >> > We could think of starting to migrate the Spark 1 runner to Spark 2
>> and
>> >> > follow with Dataset API support feature-by-feature as ot advances,
>> but I
>> >> > think most Spark installations today still run 1.X, or am I wrong ?
>> >> >
>> >> > On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejía <ieme...@gmail.com>
>> wrote:
>> >> >
>> >> >> BIG +1 JB,
>> >> >>
>> >> >> If we can just jump the version number with minor changes staying as
>> >> >> close as possible to the current implementation for spark 1 we can go
>> >> >> faster and offer in principle the exact same support but for version
>> >> >> 2.
>> >> >>
>> >> >> I know that the advanced streaming stuff based on the DataSet API
>> >> >> won't be there but with this common canvas the community can iterate
>> >> >> to create a DataSet based translator at the same time. In particular
>> I
>> >> >> consider the most important thing is that the spark 2 branch should
>> >> >> not live for long time, this should be merged into master really fast
>> >> >> for the benefit of everybody.
>> >> >>
>> >> >> Ismaël
>> >> >>
>> >> >>
>> >> >> On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré <
>> j...@nanthrax.net>
>> >> >> wrote:
>> >> >> > Hi Amit,
>> >> >> >
>> >> >> > What do you think of the following:
>> >> >> >
>> >> >> > - in the mean time that you reintroduce the Spark 2 branch, what
>> about
>> >> >> > "extending" the version in the current Spark runner ? Still using
>> >> >> > RDD/DStream, I think we can support Spark 2.x even if we don't yet
>> >> >> leverage
>> >> >> > the new provided features.
>> >> >> >
>> >> >> > Thoughts ?
>> >> >> >
>> >> >> > Regards
>> >> >> > JB
>> >> >> >
>> >> >> >
>> >> >> > On 03/15/2017 07:39 PM, Amit Sela wrote:
>> >> >> >>
>> >> >> >> Hi Cody,
>> >> >> >>
>> >> >> >> I will re-introduce this branch soon as part of the work on
>> BEAM-913
>> >> >> >> <https://issues.apache.org/jira/browse/BEAM-913>.
>> >> >> >> For now, and from previous experience with the mentioned branch,
>> >> batch
>> >> >> >> implementation should be straight-forward.
>> >> >> >> Only issue is with streaming support - in the current runner
>> (Spark
>> >> 1.x)
>> >> >> >> we
>> >> >> >> have experimental support for windows/triggers and we're working
>> >> towards
>> >> >> >> full streaming support.
>> >> >> >> With Spark 2.x, there is no "general-purpose" stateful operator
>> for
>> >> the
>> >> >> >> Dataset API, so I was waiting to see if the new operator
>> >> >> >> <https://github.com/apache/spark/pull/17179> planned for next
>> >> version
>> >> >> >> could
>> >> >> >> help with that.
>> >> >> >>
>> >> >> >> To summarize, I will introduce a skeleton for the Spark 2 runner
>> with
>> >> >> >> batch
>> >> >> >> support as soon as I can as a separate branch.
>> >> >> >>
>> >> >> >> Thanks,
>> >> >> >> Amit
>> >> >> >>
>> >> >> >> On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere <
>> e.neve...@gmail.com>
>> >> >> >> wrote:
>> >> >> >>
>> >> >> >>> Hi guys,
>> >> >> >>> Is there anybody who's currently working on Spark 2.x runner? A
>> old
>> >> PR
>> >> >> >>> for
>> >> >> >>> spark 2.x runner was closed a few days ago, so I wonder what's
>> the
>> >> >> status
>> >> >> >>> now, and is there a roadmap for this?
>> >> >> >>> Thanks~
>> >> >> >>>
>> >> >> >>
>> >> >> >
>> >> >> > --
>> >> >> > Jean-Baptiste Onofré
>> >> >> > jbono...@apache.org
>> >> >> > http://blog.nanthrax.net
>> >> >> > Talend - http://www.talend.com
>> >> >>
>> >>
>>

Reply via email to