Re: Beam spark 2.x runner status

2017-08-21 Thread Holden Karau
I'd love to take a look at the PR when it comes in (<3 BEAM + SPARK :)).

On Mon, Aug 21, 2017 at 11:33 AM, Jean-Baptiste Onofré 
wrote:

> Hi
>
> I did a new runner supporting spark 2.1.x. I changed code for that.
>
> I'm still in vacation this week. I will send an update when back.
>
> Regards
> JB
>
> On Aug 21, 2017, 09:01, at 09:01, Pei HE  wrote:
> >Any updates for upgrading to spark 2.x?
> >
> >I tried to replace the dependency and found a compile error from
> >implementing a scala trait:
> >org.apache.beam.runners.spark.io.SourceRDD.SourcePartition is not
> >abstract
> >and does not override abstract method
> >org$apache$spark$Partition$$super$equals(java.lang.Object) in
> >org.apache.spark.Partition
> >
> >(The spark side change was introduced in
> >https://github.com/apache/spark/pull/12157.)
> >
> >Does anyone have ideas about this compile error?
> >
> >
> >On Wed, May 3, 2017 at 1:32 PM, Jean-Baptiste Onofré 
> >wrote:
> >
> >> Hi Ted,
> >>
> >> My branch used Spark 2.1.0 and I just updated to 2.1.1.
> >>
> >> As discussed with Aviem, I should be able to create the pull request
> >later
> >> today.
> >>
> >> Regards
> >> JB
> >>
> >>
> >> On 05/03/2017 02:50 AM, Ted Yu wrote:
> >>
> >>> Spark 2.1.1 has been released.
> >>>
> >>> Consider using the new release in this work.
> >>>
> >>> Thanks
> >>>
> >>> On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré
> >
> >>> wrote:
> >>>
> >>> Cool for the PR merge, I will rebase my branch on it.
> 
>  Thanks !
>  Regards
>  JB
> 
> 
>  On 03/29/2017 01:58 PM, Amit Sela wrote:
> 
>  @Ted definitely makes sense.
> > @JB I'm merging https://github.com/apache/beam/pull/2354 soon so
> >any
> > deprecated Spark API issues should be resolved.
> >
> > On Wed, Mar 29, 2017 at 2:46 PM Ted Yu 
> >wrote:
> >
> > This is what I did over HBASE-16179:
> >
> >>
> >> -f.call((asJavaIterator(it), conn)).iterator()
> >> +// the return type is different in spark 1.x & 2.x, we
> >handle
> >> both
> >> cases
> >> +f.call(asJavaIterator(it), conn) match {
> >> +  // spark 1.x
> >> +  case iterable: Iterable[R] => iterable.iterator()
> >> +  // spark 2.x
> >> +  case iterator: Iterator[R] => iterator
> >> +}
> >>)
> >>
> >> FYI
> >>
> >> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela 
> >> wrote:
> >>
> >> Just tried to replace dependencies and see what happens:
> >>
> >>>
> >>> Most required changes are about the runner using deprecated
> >Spark
> >>> APIs,
> >>>
> >>> and
> >>
> >> after fixing them the only real issue is with the Java API for
> >>> Pair/FlatMapFunction that changed return value to Iterator (in
> >1.6 its
> >>> Iterable).
> >>>
> >>> So I'm not sure that a profile that simply sets dependency on
> >>> 1.6.3/2.1.0
> >>> is feasible.
> >>>
> >>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant
> >
> >>> wrote:
> >>>
> >>> So, if everything is in place in Spark 2.X and we use provided
> >>>
> 
>  dependencies
> >>>
> >>> for Spark in Beam.
>  Theoretically, you can run the same code in 2.X without any
> >need for
>  a
>  branch?
> 
>  2017-03-23 9:47 GMT+02:00 Amit Sela :
> 
>  If StreamingContext is valid and we don't have to use
> >SparkSession,
> 
> >
> > and
> 
> >>>
> >> Accumulators are valid as well and we don't need AccumulatorsV2,
> >I
> >>>
> 
> > don't
> 
> >>>
> >>> see a reason this shouldn't work (which means there are still
> >tons of
> 
> > reasons this could break, but I can't think of them off the
> >top of
> > my
> >
> > head
> 
>  right now).
> >
> > @JB simply add a profile for the Spark dependencies and run
> >the
> >
> > tests -
> 
> >>>
> >> you'll have a very definitive answer ;-) .
> >>>
>  If this passes, try on a cluster running Spark 2 as well.
> >
> > Let me know of I can assist.
> >
> > On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
> >
> > j...@nanthrax.net>
> 
> >>>
> >> wrote:
> >>>
> 
> > Hi guys,
> >
> >>
> >> Ismaël summarize well what I have in mind.
> >>
> >> I'm a bit late on the PoC around that (I started a branch
> >already).
> >> I will move forward over the week end.
> >>
> >> Regards
> >> JB
> >>
> >> On 03/22/2017 11:42 PM, Ismaël 

Re: Beam spark 2.x runner status

2017-08-21 Thread Jean-Baptiste Onofré
Hi

I did a new runner supporting spark 2.1.x. I changed code for that.

I'm still in vacation this week. I will send an update when back.

Regards
JB

On Aug 21, 2017, 09:01, at 09:01, Pei HE  wrote:
>Any updates for upgrading to spark 2.x?
>
>I tried to replace the dependency and found a compile error from
>implementing a scala trait:
>org.apache.beam.runners.spark.io.SourceRDD.SourcePartition is not
>abstract
>and does not override abstract method
>org$apache$spark$Partition$$super$equals(java.lang.Object) in
>org.apache.spark.Partition
>
>(The spark side change was introduced in
>https://github.com/apache/spark/pull/12157.)
>
>Does anyone have ideas about this compile error?
>
>
>On Wed, May 3, 2017 at 1:32 PM, Jean-Baptiste Onofré 
>wrote:
>
>> Hi Ted,
>>
>> My branch used Spark 2.1.0 and I just updated to 2.1.1.
>>
>> As discussed with Aviem, I should be able to create the pull request
>later
>> today.
>>
>> Regards
>> JB
>>
>>
>> On 05/03/2017 02:50 AM, Ted Yu wrote:
>>
>>> Spark 2.1.1 has been released.
>>>
>>> Consider using the new release in this work.
>>>
>>> Thanks
>>>
>>> On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré
>
>>> wrote:
>>>
>>> Cool for the PR merge, I will rebase my branch on it.

 Thanks !
 Regards
 JB


 On 03/29/2017 01:58 PM, Amit Sela wrote:

 @Ted definitely makes sense.
> @JB I'm merging https://github.com/apache/beam/pull/2354 soon so
>any
> deprecated Spark API issues should be resolved.
>
> On Wed, Mar 29, 2017 at 2:46 PM Ted Yu 
>wrote:
>
> This is what I did over HBASE-16179:
>
>>
>> -f.call((asJavaIterator(it), conn)).iterator()
>> +// the return type is different in spark 1.x & 2.x, we
>handle
>> both
>> cases
>> +f.call(asJavaIterator(it), conn) match {
>> +  // spark 1.x
>> +  case iterable: Iterable[R] => iterable.iterator()
>> +  // spark 2.x
>> +  case iterator: Iterator[R] => iterator
>> +}
>>)
>>
>> FYI
>>
>> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela 
>> wrote:
>>
>> Just tried to replace dependencies and see what happens:
>>
>>>
>>> Most required changes are about the runner using deprecated
>Spark
>>> APIs,
>>>
>>> and
>>
>> after fixing them the only real issue is with the Java API for
>>> Pair/FlatMapFunction that changed return value to Iterator (in
>1.6 its
>>> Iterable).
>>>
>>> So I'm not sure that a profile that simply sets dependency on
>>> 1.6.3/2.1.0
>>> is feasible.
>>>
>>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant
>
>>> wrote:
>>>
>>> So, if everything is in place in Spark 2.X and we use provided
>>>

 dependencies
>>>
>>> for Spark in Beam.
 Theoretically, you can run the same code in 2.X without any
>need for
 a
 branch?

 2017-03-23 9:47 GMT+02:00 Amit Sela :

 If StreamingContext is valid and we don't have to use
>SparkSession,

>
> and

>>>
>> Accumulators are valid as well and we don't need AccumulatorsV2,
>I
>>>

> don't

>>>
>>> see a reason this shouldn't work (which means there are still
>tons of

> reasons this could break, but I can't think of them off the
>top of
> my
>
> head

 right now).
>
> @JB simply add a profile for the Spark dependencies and run
>the
>
> tests -

>>>
>> you'll have a very definitive answer ;-) .
>>>
 If this passes, try on a cluster running Spark 2 as well.
>
> Let me know of I can assist.
>
> On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
>
> j...@nanthrax.net>

>>>
>> wrote:
>>>

> Hi guys,
>
>>
>> Ismaël summarize well what I have in mind.
>>
>> I'm a bit late on the PoC around that (I started a branch
>already).
>> I will move forward over the week end.
>>
>> Regards
>> JB
>>
>> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
>>
>> Amit, I suppose JB is talking about the RDD based version, so
>no
>>>
>>> need
>>
>
>>> to worry about SparkSession or different incompatible APIs.

>
>>> Remember the idea we are discussing is to have in master
>both the
>>> spark 1 and spark 2 runners using the RDD based translation.
>At
>>>
>>> the
>>
>
>> same time we can have a feature branch to evolve the DataSet
>>>

Re: Beam spark 2.x runner status

2017-08-21 Thread Pei HE
Any updates for upgrading to spark 2.x?

I tried to replace the dependency and found a compile error from
implementing a scala trait:
org.apache.beam.runners.spark.io.SourceRDD.SourcePartition is not abstract
and does not override abstract method
org$apache$spark$Partition$$super$equals(java.lang.Object) in
org.apache.spark.Partition

(The spark side change was introduced in
https://github.com/apache/spark/pull/12157.)

Does anyone have ideas about this compile error?


On Wed, May 3, 2017 at 1:32 PM, Jean-Baptiste Onofré 
wrote:

> Hi Ted,
>
> My branch used Spark 2.1.0 and I just updated to 2.1.1.
>
> As discussed with Aviem, I should be able to create the pull request later
> today.
>
> Regards
> JB
>
>
> On 05/03/2017 02:50 AM, Ted Yu wrote:
>
>> Spark 2.1.1 has been released.
>>
>> Consider using the new release in this work.
>>
>> Thanks
>>
>> On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Cool for the PR merge, I will rebase my branch on it.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>>
>>> On 03/29/2017 01:58 PM, Amit Sela wrote:
>>>
>>> @Ted definitely makes sense.
 @JB I'm merging https://github.com/apache/beam/pull/2354 soon so any
 deprecated Spark API issues should be resolved.

 On Wed, Mar 29, 2017 at 2:46 PM Ted Yu  wrote:

 This is what I did over HBASE-16179:

>
> -f.call((asJavaIterator(it), conn)).iterator()
> +// the return type is different in spark 1.x & 2.x, we handle
> both
> cases
> +f.call(asJavaIterator(it), conn) match {
> +  // spark 1.x
> +  case iterable: Iterable[R] => iterable.iterator()
> +  // spark 2.x
> +  case iterator: Iterator[R] => iterator
> +}
>)
>
> FYI
>
> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela 
> wrote:
>
> Just tried to replace dependencies and see what happens:
>
>>
>> Most required changes are about the runner using deprecated Spark
>> APIs,
>>
>> and
>
> after fixing them the only real issue is with the Java API for
>> Pair/FlatMapFunction that changed return value to Iterator (in 1.6 its
>> Iterable).
>>
>> So I'm not sure that a profile that simply sets dependency on
>> 1.6.3/2.1.0
>> is feasible.
>>
>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant 
>> wrote:
>>
>> So, if everything is in place in Spark 2.X and we use provided
>>
>>>
>>> dependencies
>>
>> for Spark in Beam.
>>> Theoretically, you can run the same code in 2.X without any need for
>>> a
>>> branch?
>>>
>>> 2017-03-23 9:47 GMT+02:00 Amit Sela :
>>>
>>> If StreamingContext is valid and we don't have to use SparkSession,
>>>

 and
>>>
>>
> Accumulators are valid as well and we don't need AccumulatorsV2, I
>>
>>>
 don't
>>>
>>
>> see a reason this shouldn't work (which means there are still tons of
>>>
 reasons this could break, but I can't think of them off the top of
 my

 head
>>>
>>> right now).

 @JB simply add a profile for the Spark dependencies and run the

 tests -
>>>
>>
> you'll have a very definitive answer ;-) .
>>
>>> If this passes, try on a cluster running Spark 2 as well.

 Let me know of I can assist.

 On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <

 j...@nanthrax.net>
>>>
>>
> wrote:
>>
>>>
 Hi guys,

>
> Ismaël summarize well what I have in mind.
>
> I'm a bit late on the PoC around that (I started a branch already).
> I will move forward over the week end.
>
> Regards
> JB
>
> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
>
> Amit, I suppose JB is talking about the RDD based version, so no
>>
>> need
>

>> to worry about SparkSession or different incompatible APIs.
>>>

>> Remember the idea we are discussing is to have in master both the
>> spark 1 and spark 2 runners using the RDD based translation. At
>>
>> the
>

> same time we can have a feature branch to evolve the DataSet
>>
>>>
>> based
>

> translator (this one will replace the RDD based translator for
>>
>>>
>> spark
>

>> 2
>>>
>>> once it is mature).

>
>> The advantages have been already discussed as well as the
>>
>> possible
>

> issues so I think we have to see now if JB's idea is feasible 

Re: Beam spark 2.x runner status

2017-03-29 Thread Jean-Baptiste Onofré

Cool for the PR merge, I will rebase my branch on it.

Thanks !
Regards
JB

On 03/29/2017 01:58 PM, Amit Sela wrote:

@Ted definitely makes sense.
@JB I'm merging https://github.com/apache/beam/pull/2354 soon so any
deprecated Spark API issues should be resolved.

On Wed, Mar 29, 2017 at 2:46 PM Ted Yu  wrote:


This is what I did over HBASE-16179:

-f.call((asJavaIterator(it), conn)).iterator()
+// the return type is different in spark 1.x & 2.x, we handle both
cases
+f.call(asJavaIterator(it), conn) match {
+  // spark 1.x
+  case iterable: Iterable[R] => iterable.iterator()
+  // spark 2.x
+  case iterator: Iterator[R] => iterator
+}
   )

FYI

On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela  wrote:


Just tried to replace dependencies and see what happens:

Most required changes are about the runner using deprecated Spark APIs,

and

after fixing them the only real issue is with the Java API for
Pair/FlatMapFunction that changed return value to Iterator (in 1.6 its
Iterable).

So I'm not sure that a profile that simply sets dependency on 1.6.3/2.1.0
is feasible.

On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant 
wrote:


So, if everything is in place in Spark 2.X and we use provided

dependencies

for Spark in Beam.
Theoretically, you can run the same code in 2.X without any need for a
branch?

2017-03-23 9:47 GMT+02:00 Amit Sela :


If StreamingContext is valid and we don't have to use SparkSession,

and

Accumulators are valid as well and we don't need AccumulatorsV2, I

don't

see a reason this shouldn't work (which means there are still tons of
reasons this could break, but I can't think of them off the top of my

head

right now).

@JB simply add a profile for the Spark dependencies and run the

tests -

you'll have a very definitive answer ;-) .
If this passes, try on a cluster running Spark 2 as well.

Let me know of I can assist.

On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <

j...@nanthrax.net>

wrote:


Hi guys,

Ismaël summarize well what I have in mind.

I'm a bit late on the PoC around that (I started a branch already).
I will move forward over the week end.

Regards
JB

On 03/22/2017 11:42 PM, Ismaël Mejía wrote:

Amit, I suppose JB is talking about the RDD based version, so no

need

to worry about SparkSession or different incompatible APIs.

Remember the idea we are discussing is to have in master both the
spark 1 and spark 2 runners using the RDD based translation. At

the

same time we can have a feature branch to evolve the DataSet

based

translator (this one will replace the RDD based translator for

spark

2

once it is mature).

The advantages have been already discussed as well as the

possible

issues so I think we have to see now if JB's idea is feasible and

how

hard would be to live with this while the DataSet version

evolves.


I think what we are trying to avoid is to have a long living

branch

for a spark 2 runner based on RDD  because the maintenance burden
would be even worse. We would have to fight not only with the

double

merge of fixes (in case the profile idea does not work), but also

with

the continue evolution of Beam and we would end up in the long

living

branch mess that others runners have dealt with (e.g. the Apex

runner)






https://lists.apache.org/thread.html/12cc086f5ffe331cc70b89322ce541

6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E


What do you think about this Amit ? Would you be ok to go with it

if

JB's profile idea proves to help with the msintenance issues ?

Ismaël



On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu 

wrote:

hbase-spark module doesn't use SparkSession. So situation there

is

simpler

:-)

On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela <

amitsel...@gmail.com>

wrote:



I'm still wondering how we'll do this - it's not just different
implementations of the same Class, but a completely different

concepts

such

as using SparkSession in Spark 2 instead of

SparkContext/StreamingContext

in Spark 1.

On Tue, Mar 21, 2017 at 7:25 PM Ted Yu 

wrote:



I have done some work over in HBASE-16179 where compatibility

modules

are

created to isolate changes in Spark 2.x API so that code in

hbase-spark

module can be reused.

FYI





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com













--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Beam spark 2.x runner status

2017-03-29 Thread Amit Sela
@Ted definitely makes sense.
@JB I'm merging https://github.com/apache/beam/pull/2354 soon so any
deprecated Spark API issues should be resolved.

On Wed, Mar 29, 2017 at 2:46 PM Ted Yu  wrote:

> This is what I did over HBASE-16179:
>
> -f.call((asJavaIterator(it), conn)).iterator()
> +// the return type is different in spark 1.x & 2.x, we handle both
> cases
> +f.call(asJavaIterator(it), conn) match {
> +  // spark 1.x
> +  case iterable: Iterable[R] => iterable.iterator()
> +  // spark 2.x
> +  case iterator: Iterator[R] => iterator
> +}
>)
>
> FYI
>
> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela  wrote:
>
> > Just tried to replace dependencies and see what happens:
> >
> > Most required changes are about the runner using deprecated Spark APIs,
> and
> > after fixing them the only real issue is with the Java API for
> > Pair/FlatMapFunction that changed return value to Iterator (in 1.6 its
> > Iterable).
> >
> > So I'm not sure that a profile that simply sets dependency on 1.6.3/2.1.0
> > is feasible.
> >
> > On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant 
> > wrote:
> >
> > > So, if everything is in place in Spark 2.X and we use provided
> > dependencies
> > > for Spark in Beam.
> > > Theoretically, you can run the same code in 2.X without any need for a
> > > branch?
> > >
> > > 2017-03-23 9:47 GMT+02:00 Amit Sela :
> > >
> > > > If StreamingContext is valid and we don't have to use SparkSession,
> and
> > > > Accumulators are valid as well and we don't need AccumulatorsV2, I
> > don't
> > > > see a reason this shouldn't work (which means there are still tons of
> > > > reasons this could break, but I can't think of them off the top of my
> > > head
> > > > right now).
> > > >
> > > > @JB simply add a profile for the Spark dependencies and run the
> tests -
> > > > you'll have a very definitive answer ;-) .
> > > > If this passes, try on a cluster running Spark 2 as well.
> > > >
> > > > Let me know of I can assist.
> > > >
> > > > On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > > > wrote:
> > > >
> > > > > Hi guys,
> > > > >
> > > > > Ismaël summarize well what I have in mind.
> > > > >
> > > > > I'm a bit late on the PoC around that (I started a branch already).
> > > > > I will move forward over the week end.
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
> > > > > > Amit, I suppose JB is talking about the RDD based version, so no
> > need
> > > > > > to worry about SparkSession or different incompatible APIs.
> > > > > >
> > > > > > Remember the idea we are discussing is to have in master both the
> > > > > > spark 1 and spark 2 runners using the RDD based translation. At
> the
> > > > > > same time we can have a feature branch to evolve the DataSet
> based
> > > > > > translator (this one will replace the RDD based translator for
> > spark
> > > 2
> > > > > > once it is mature).
> > > > > >
> > > > > > The advantages have been already discussed as well as the
> possible
> > > > > > issues so I think we have to see now if JB's idea is feasible and
> > how
> > > > > > hard would be to live with this while the DataSet version
> evolves.
> > > > > >
> > > > > > I think what we are trying to avoid is to have a long living
> branch
> > > > > > for a spark 2 runner based on RDD  because the maintenance burden
> > > > > > would be even worse. We would have to fight not only with the
> > double
> > > > > > merge of fixes (in case the profile idea does not work), but also
> > > with
> > > > > > the continue evolution of Beam and we would end up in the long
> > living
> > > > > > branch mess that others runners have dealt with (e.g. the Apex
> > > runner)
> > > > > >
> > > > > >
> > > > >
> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b89322ce541
> > > > 6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
> > > > > >
> > > > > > What do you think about this Amit ? Would you be ok to go with it
> > if
> > > > > > JB's profile idea proves to help with the msintenance issues ?
> > > > > >
> > > > > > Ismaël
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu 
> > wrote:
> > > > > >> hbase-spark module doesn't use SparkSession. So situation there
> is
> > > > > simpler
> > > > > >> :-)
> > > > > >>
> > > > > >> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela <
> amitsel...@gmail.com>
> > > > > wrote:
> > > > > >>
> > > > > >>> I'm still wondering how we'll do this - it's not just different
> > > > > >>> implementations of the same Class, but a completely different
> > > > concepts
> > > > > such
> > > > > >>> as using SparkSession in Spark 2 instead of
> > > > > SparkContext/StreamingContext
> > > > > >>> in Spark 1.
> > > > > >>>
> > > > > >>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu 
> 

Re: Beam spark 2.x runner status

2017-03-29 Thread Ted Yu
This is what I did over HBASE-16179:

-f.call((asJavaIterator(it), conn)).iterator()
+// the return type is different in spark 1.x & 2.x, we handle both
cases
+f.call(asJavaIterator(it), conn) match {
+  // spark 1.x
+  case iterable: Iterable[R] => iterable.iterator()
+  // spark 2.x
+  case iterator: Iterator[R] => iterator
+}
   )

FYI

On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela  wrote:

> Just tried to replace dependencies and see what happens:
>
> Most required changes are about the runner using deprecated Spark APIs, and
> after fixing them the only real issue is with the Java API for
> Pair/FlatMapFunction that changed return value to Iterator (in 1.6 its
> Iterable).
>
> So I'm not sure that a profile that simply sets dependency on 1.6.3/2.1.0
> is feasible.
>
> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant 
> wrote:
>
> > So, if everything is in place in Spark 2.X and we use provided
> dependencies
> > for Spark in Beam.
> > Theoretically, you can run the same code in 2.X without any need for a
> > branch?
> >
> > 2017-03-23 9:47 GMT+02:00 Amit Sela :
> >
> > > If StreamingContext is valid and we don't have to use SparkSession, and
> > > Accumulators are valid as well and we don't need AccumulatorsV2, I
> don't
> > > see a reason this shouldn't work (which means there are still tons of
> > > reasons this could break, but I can't think of them off the top of my
> > head
> > > right now).
> > >
> > > @JB simply add a profile for the Spark dependencies and run the tests -
> > > you'll have a very definitive answer ;-) .
> > > If this passes, try on a cluster running Spark 2 as well.
> > >
> > > Let me know of I can assist.
> > >
> > > On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré 
> > > wrote:
> > >
> > > > Hi guys,
> > > >
> > > > Ismaël summarize well what I have in mind.
> > > >
> > > > I'm a bit late on the PoC around that (I started a branch already).
> > > > I will move forward over the week end.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
> > > > > Amit, I suppose JB is talking about the RDD based version, so no
> need
> > > > > to worry about SparkSession or different incompatible APIs.
> > > > >
> > > > > Remember the idea we are discussing is to have in master both the
> > > > > spark 1 and spark 2 runners using the RDD based translation. At the
> > > > > same time we can have a feature branch to evolve the DataSet based
> > > > > translator (this one will replace the RDD based translator for
> spark
> > 2
> > > > > once it is mature).
> > > > >
> > > > > The advantages have been already discussed as well as the possible
> > > > > issues so I think we have to see now if JB's idea is feasible and
> how
> > > > > hard would be to live with this while the DataSet version evolves.
> > > > >
> > > > > I think what we are trying to avoid is to have a long living branch
> > > > > for a spark 2 runner based on RDD  because the maintenance burden
> > > > > would be even worse. We would have to fight not only with the
> double
> > > > > merge of fixes (in case the profile idea does not work), but also
> > with
> > > > > the continue evolution of Beam and we would end up in the long
> living
> > > > > branch mess that others runners have dealt with (e.g. the Apex
> > runner)
> > > > >
> > > > >
> > > > https://lists.apache.org/thread.html/12cc086f5ffe331cc70b89322ce541
> > > 6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
> > > > >
> > > > > What do you think about this Amit ? Would you be ok to go with it
> if
> > > > > JB's profile idea proves to help with the msintenance issues ?
> > > > >
> > > > > Ismaël
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu 
> wrote:
> > > > >> hbase-spark module doesn't use SparkSession. So situation there is
> > > > simpler
> > > > >> :-)
> > > > >>
> > > > >> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela 
> > > > wrote:
> > > > >>
> > > > >>> I'm still wondering how we'll do this - it's not just different
> > > > >>> implementations of the same Class, but a completely different
> > > concepts
> > > > such
> > > > >>> as using SparkSession in Spark 2 instead of
> > > > SparkContext/StreamingContext
> > > > >>> in Spark 1.
> > > > >>>
> > > > >>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu 
> > wrote:
> > > > >>>
> > > >  I have done some work over in HBASE-16179 where compatibility
> > > modules
> > > > are
> > > >  created to isolate changes in Spark 2.x API so that code in
> > > > hbase-spark
> > > >  module can be reused.
> > > > 
> > > >  FYI
> > > > 
> > > > >>>
> > > >
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbono...@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> >
>


Re: Beam spark 2.x runner status

2017-03-29 Thread Amit Sela
Just tried to replace dependencies and see what happens:

Most required changes are about the runner using deprecated Spark APIs, and
after fixing them the only real issue is with the Java API for
Pair/FlatMapFunction that changed return value to Iterator (in 1.6 its
Iterable).

So I'm not sure that a profile that simply sets dependency on 1.6.3/2.1.0
is feasible.

On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant  wrote:

> So, if everything is in place in Spark 2.X and we use provided dependencies
> for Spark in Beam.
> Theoretically, you can run the same code in 2.X without any need for a
> branch?
>
> 2017-03-23 9:47 GMT+02:00 Amit Sela :
>
> > If StreamingContext is valid and we don't have to use SparkSession, and
> > Accumulators are valid as well and we don't need AccumulatorsV2, I don't
> > see a reason this shouldn't work (which means there are still tons of
> > reasons this could break, but I can't think of them off the top of my
> head
> > right now).
> >
> > @JB simply add a profile for the Spark dependencies and run the tests -
> > you'll have a very definitive answer ;-) .
> > If this passes, try on a cluster running Spark 2 as well.
> >
> > Let me know of I can assist.
> >
> > On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré 
> > wrote:
> >
> > > Hi guys,
> > >
> > > Ismaël summarize well what I have in mind.
> > >
> > > I'm a bit late on the PoC around that (I started a branch already).
> > > I will move forward over the week end.
> > >
> > > Regards
> > > JB
> > >
> > > On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
> > > > Amit, I suppose JB is talking about the RDD based version, so no need
> > > > to worry about SparkSession or different incompatible APIs.
> > > >
> > > > Remember the idea we are discussing is to have in master both the
> > > > spark 1 and spark 2 runners using the RDD based translation. At the
> > > > same time we can have a feature branch to evolve the DataSet based
> > > > translator (this one will replace the RDD based translator for spark
> 2
> > > > once it is mature).
> > > >
> > > > The advantages have been already discussed as well as the possible
> > > > issues so I think we have to see now if JB's idea is feasible and how
> > > > hard would be to live with this while the DataSet version evolves.
> > > >
> > > > I think what we are trying to avoid is to have a long living branch
> > > > for a spark 2 runner based on RDD  because the maintenance burden
> > > > would be even worse. We would have to fight not only with the double
> > > > merge of fixes (in case the profile idea does not work), but also
> with
> > > > the continue evolution of Beam and we would end up in the long living
> > > > branch mess that others runners have dealt with (e.g. the Apex
> runner)
> > > >
> > > >
> > > https://lists.apache.org/thread.html/12cc086f5ffe331cc70b89322ce541
> > 6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
> > > >
> > > > What do you think about this Amit ? Would you be ok to go with it if
> > > > JB's profile idea proves to help with the msintenance issues ?
> > > >
> > > > Ismaël
> > > >
> > > >
> > > >
> > > > On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu  wrote:
> > > >> hbase-spark module doesn't use SparkSession. So situation there is
> > > simpler
> > > >> :-)
> > > >>
> > > >> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela 
> > > wrote:
> > > >>
> > > >>> I'm still wondering how we'll do this - it's not just different
> > > >>> implementations of the same Class, but a completely different
> > concepts
> > > such
> > > >>> as using SparkSession in Spark 2 instead of
> > > SparkContext/StreamingContext
> > > >>> in Spark 1.
> > > >>>
> > > >>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu 
> wrote:
> > > >>>
> > >  I have done some work over in HBASE-16179 where compatibility
> > modules
> > > are
> > >  created to isolate changes in Spark 2.x API so that code in
> > > hbase-spark
> > >  module can be reused.
> > > 
> > >  FYI
> > > 
> > > >>>
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>


Re: Beam spark 2.x runner status

2017-03-23 Thread Amit Sela
If StreamingContext is valid and we don't have to use SparkSession, and
Accumulators are valid as well and we don't need AccumulatorsV2, I don't
see a reason this shouldn't work (which means there are still tons of
reasons this could break, but I can't think of them off the top of my head
right now).

@JB simply add a profile for the Spark dependencies and run the tests -
you'll have a very definitive answer ;-) .
If this passes, try on a cluster running Spark 2 as well.

Let me know of I can assist.

On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré 
wrote:

> Hi guys,
>
> Ismaël summarize well what I have in mind.
>
> I'm a bit late on the PoC around that (I started a branch already).
> I will move forward over the week end.
>
> Regards
> JB
>
> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
> > Amit, I suppose JB is talking about the RDD based version, so no need
> > to worry about SparkSession or different incompatible APIs.
> >
> > Remember the idea we are discussing is to have in master both the
> > spark 1 and spark 2 runners using the RDD based translation. At the
> > same time we can have a feature branch to evolve the DataSet based
> > translator (this one will replace the RDD based translator for spark 2
> > once it is mature).
> >
> > The advantages have been already discussed as well as the possible
> > issues so I think we have to see now if JB's idea is feasible and how
> > hard would be to live with this while the DataSet version evolves.
> >
> > I think what we are trying to avoid is to have a long living branch
> > for a spark 2 runner based on RDD  because the maintenance burden
> > would be even worse. We would have to fight not only with the double
> > merge of fixes (in case the profile idea does not work), but also with
> > the continue evolution of Beam and we would end up in the long living
> > branch mess that others runners have dealt with (e.g. the Apex runner)
> >
> >
> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b89322ce5416c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
> >
> > What do you think about this Amit ? Would you be ok to go with it if
> > JB's profile idea proves to help with the msintenance issues ?
> >
> > Ismaël
> >
> >
> >
> > On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu  wrote:
> >> hbase-spark module doesn't use SparkSession. So situation there is
> simpler
> >> :-)
> >>
> >> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela 
> wrote:
> >>
> >>> I'm still wondering how we'll do this - it's not just different
> >>> implementations of the same Class, but a completely different concepts
> such
> >>> as using SparkSession in Spark 2 instead of
> SparkContext/StreamingContext
> >>> in Spark 1.
> >>>
> >>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu  wrote:
> >>>
>  I have done some work over in HBASE-16179 where compatibility modules
> are
>  created to isolate changes in Spark 2.x API so that code in
> hbase-spark
>  module can be reused.
> 
>  FYI
> 
> >>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Beam spark 2.x runner status

2017-03-23 Thread Kobi Salant
Hi,

We use SparkContext & SparkContextStreaming extensively in Spark runner to
create the Dsteams & Rdds so we will need to work on migrating from the 1.X
terms to 2.X terms (We may other incompatibilities that we will find out
during the work).

Regards
Kobi


2017-03-23 6:55 GMT+02:00 Jean-Baptiste Onofré :

> Hi guys,
>
> Ismaël summarize well what I have in mind.
>
> I'm a bit late on the PoC around that (I started a branch already).
> I will move forward over the week end.
>
> Regards
> JB
>
>
> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
>
>> Amit, I suppose JB is talking about the RDD based version, so no need
>> to worry about SparkSession or different incompatible APIs.
>>
>> Remember the idea we are discussing is to have in master both the
>> spark 1 and spark 2 runners using the RDD based translation. At the
>> same time we can have a feature branch to evolve the DataSet based
>> translator (this one will replace the RDD based translator for spark 2
>> once it is mature).
>>
>> The advantages have been already discussed as well as the possible
>> issues so I think we have to see now if JB's idea is feasible and how
>> hard would be to live with this while the DataSet version evolves.
>>
>> I think what we are trying to avoid is to have a long living branch
>> for a spark 2 runner based on RDD  because the maintenance burden
>> would be even worse. We would have to fight not only with the double
>> merge of fixes (in case the profile idea does not work), but also with
>> the continue evolution of Beam and we would end up in the long living
>> branch mess that others runners have dealt with (e.g. the Apex runner)
>>
>> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b893
>> 22ce5416c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
>>
>> What do you think about this Amit ? Would you be ok to go with it if
>> JB's profile idea proves to help with the msintenance issues ?
>>
>> Ismaël
>>
>>
>>
>> On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu  wrote:
>>
>>> hbase-spark module doesn't use SparkSession. So situation there is
>>> simpler
>>> :-)
>>>
>>> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela  wrote:
>>>
>>> I'm still wondering how we'll do this - it's not just different
 implementations of the same Class, but a completely different concepts
 such
 as using SparkSession in Spark 2 instead of
 SparkContext/StreamingContext
 in Spark 1.

 On Tue, Mar 21, 2017 at 7:25 PM Ted Yu  wrote:

 I have done some work over in HBASE-16179 where compatibility modules
> are
> created to isolate changes in Spark 2.x API so that code in hbase-spark
> module can be reused.
>
> FYI
>
>

> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Beam spark 2.x runner status

2017-03-22 Thread Jean-Baptiste Onofré

Hi guys,

Ismaël summarize well what I have in mind.

I'm a bit late on the PoC around that (I started a branch already).
I will move forward over the week end.

Regards
JB

On 03/22/2017 11:42 PM, Ismaël Mejía wrote:

Amit, I suppose JB is talking about the RDD based version, so no need
to worry about SparkSession or different incompatible APIs.

Remember the idea we are discussing is to have in master both the
spark 1 and spark 2 runners using the RDD based translation. At the
same time we can have a feature branch to evolve the DataSet based
translator (this one will replace the RDD based translator for spark 2
once it is mature).

The advantages have been already discussed as well as the possible
issues so I think we have to see now if JB's idea is feasible and how
hard would be to live with this while the DataSet version evolves.

I think what we are trying to avoid is to have a long living branch
for a spark 2 runner based on RDD  because the maintenance burden
would be even worse. We would have to fight not only with the double
merge of fixes (in case the profile idea does not work), but also with
the continue evolution of Beam and we would end up in the long living
branch mess that others runners have dealt with (e.g. the Apex runner)

https://lists.apache.org/thread.html/12cc086f5ffe331cc70b89322ce5416c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E

What do you think about this Amit ? Would you be ok to go with it if
JB's profile idea proves to help with the msintenance issues ?

Ismaël



On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu  wrote:

hbase-spark module doesn't use SparkSession. So situation there is simpler
:-)

On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela  wrote:


I'm still wondering how we'll do this - it's not just different
implementations of the same Class, but a completely different concepts such
as using SparkSession in Spark 2 instead of SparkContext/StreamingContext
in Spark 1.

On Tue, Mar 21, 2017 at 7:25 PM Ted Yu  wrote:


I have done some work over in HBASE-16179 where compatibility modules are
created to isolate changes in Spark 2.x API so that code in hbase-spark
module can be reused.

FYI





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Beam spark 2.x runner status

2017-03-22 Thread Ted Yu
hbase-spark module doesn't use SparkSession. So situation there is simpler
:-)

On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela  wrote:

> I'm still wondering how we'll do this - it's not just different
> implementations of the same Class, but a completely different concepts such
> as using SparkSession in Spark 2 instead of SparkContext/StreamingContext
> in Spark 1.
>
> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu  wrote:
>
> > I have done some work over in HBASE-16179 where compatibility modules are
> > created to isolate changes in Spark 2.x API so that code in hbase-spark
> > module can be reused.
> >
> > FYI
> >
>


Re: Beam spark 2.x runner status

2017-03-21 Thread Ted Yu
I have done some work over in HBASE-16179 where compatibility modules are
created to isolate changes in Spark 2.x API so that code in hbase-spark
module can be reused.

FYI


Re: Beam spark 2.x runner status

2017-03-16 Thread Jean-Baptiste Onofré

Hi guys,

Yes, I started to experiment the profiles a bit and Amit and I plan to discuss 
about that during the week end.


Give me some time to move forward a bit and I will get back to you with more 
details.


Regards
JB

On 03/16/2017 05:15 PM, amarouni wrote:

Yeah maintaining 2 RDD branches (master + 2.x branch) is doable but will
add more maintenance merge work.

The maven profiles solution is worth investigating, with Spark 1.6 RDD
as the default profile and an additional Spark 2.x profile.

As JBO mentioned carbondata I had a quick look and it looks like an good
solution :
https://github.com/apache/incubator-carbondata/blob/master/pom.xml#L347

What do you think ?

Abbass,

On 16/03/2017 07:00, Cody Innowhere wrote:

I'm personally in favor of maintaining one single branch, e.g.,
spark-runner, which supports both Spark 1.6 & 2.1.
Since there's currently no DataFrame support in spark 1.x runner, there
should be no conflicts if we put two versions of Spark into one runner.

I'm also +1 for adding adapters in the branch to support both Spark
versions.

Also, we can have two translators, say, 1.x translator which translates
into RDDs & DataStreams and 2.x translator which translates into DataSets.

On Thu, Mar 16, 2017 at 9:33 AM, Jean-Baptiste Onofré 
wrote:


Hi guys,

sorry, due to the time zone shift, I answer a bit late ;)

I think we can have the same runner dealing with the two major Spark
version, introducing some adapters. For instance, in CarbonData, we created
some adapters to work with Spark 1?5, Spark 1.6 and Spark 2.1. The
dependencies come from Maven profiles. Of course, it's easier there as it's
more "user" code.

My proposal is just it's worth to try ;)

I just created a branch to experiment a bit and have more details.

Regards
JB


On 03/16/2017 02:31 AM, Amit Sela wrote:


I answered inline to Abbass' comment, but I think he hit something - how
about we have a branch with those adaptations ? same RDD implementation,
but depending on the latest 2.x version with the minimal changes required.
I'd be happy to do that, or guide anyone who wants to (I did most of it on
my branch for Spark 2 anyway) but since it's a branch and not on master (I
don't believe it "deserves" a place on master), it would always be a bit
behind since we would have to rebase and merge once in a while.

How does that sound ?

On Wed, Mar 15, 2017 at 7:49 PM amarouni  wrote:

+1 for Spark runners based on different APIs RDD/Dataset and keeping the

Spark versions as a deployment dependency.

The RDD API is stable & mature enough so it makes sense to have it on
master, the Dataset API still have some work to do and from our own
experience it just reached a comparable RDD API performance. The
community is clearly heading in the Dataset API direction but the RDD
API is still a viable option for most use cases.

Just one quick question, today on master can we swap Spark 1.x by Spark
2.x  and compile and use the Spark Runner ?

Good question!

I think this is the root cause of this problem - Spark 2 not only
introduced a new API, but also broke a few such as: context is now
session,
Accumulators are AccumulatorV2, and this is what I recall right now.
I don't think it's to hard to adapt those, and anyone who wants to could
see how I did it on my branch:
https://github.com/amitsela/beam/commit/8a1cf889d14d2b47e9e3
5bae742d78a290cbbdc9




Thanks,

Abbass,


On 15/03/2017 17:57, Amit Sela wrote:


So you're suggesting we copy-paste the current runner and adapt whatever


is


necessary so it runs with Spark 2 ?
This also means any bug-fix / improvement would have to be maintained in
two runners, and I wouldn't wanna do that.

I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset


API.


Since the RDD API is mature, it should be the runner in master (not
preventing another runner once Dataset API is mature enough) and the
version (1.6.3 or 2.x) should be determined by the common installation.

That's why I believe we still need to leave things as they are, but
start
working on the Dataset API runner.
Otherwise, we'll have the current runner, another RDD API runner with


Spark


2, and a third one for the Dataset API. I don't want to maintain all of
them. It's a mess.

On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía  wrote:

However, I do feel that we should use the Dataset API, starting with

batch


support first. WDYT ?


Well, this is the exact current status quo, and it will take us some
time to have something as complete as what we have with the spark 1
runner for the spark 2.

The other proposal has two advantages:

One is that we can leverage on the existing implementation (with the
needed adjustments) to run Beam pipelines on Spark 2, in the end final
users don’t care so much if pipelines are translated via RDD/DStream
or Dataset, they just want to know that with Beam they can run their
code in their favorite data processing framework.


Re: Beam spark 2.x runner status

2017-03-16 Thread amarouni
Yeah maintaining 2 RDD branches (master + 2.x branch) is doable but will
add more maintenance merge work.

The maven profiles solution is worth investigating, with Spark 1.6 RDD
as the default profile and an additional Spark 2.x profile.

As JBO mentioned carbondata I had a quick look and it looks like an good
solution :
https://github.com/apache/incubator-carbondata/blob/master/pom.xml#L347

What do you think ?

Abbass,

On 16/03/2017 07:00, Cody Innowhere wrote:
> I'm personally in favor of maintaining one single branch, e.g.,
> spark-runner, which supports both Spark 1.6 & 2.1.
> Since there's currently no DataFrame support in spark 1.x runner, there
> should be no conflicts if we put two versions of Spark into one runner.
>
> I'm also +1 for adding adapters in the branch to support both Spark
> versions.
>
> Also, we can have two translators, say, 1.x translator which translates
> into RDDs & DataStreams and 2.x translator which translates into DataSets.
>
> On Thu, Mar 16, 2017 at 9:33 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi guys,
>>
>> sorry, due to the time zone shift, I answer a bit late ;)
>>
>> I think we can have the same runner dealing with the two major Spark
>> version, introducing some adapters. For instance, in CarbonData, we created
>> some adapters to work with Spark 1?5, Spark 1.6 and Spark 2.1. The
>> dependencies come from Maven profiles. Of course, it's easier there as it's
>> more "user" code.
>>
>> My proposal is just it's worth to try ;)
>>
>> I just created a branch to experiment a bit and have more details.
>>
>> Regards
>> JB
>>
>>
>> On 03/16/2017 02:31 AM, Amit Sela wrote:
>>
>>> I answered inline to Abbass' comment, but I think he hit something - how
>>> about we have a branch with those adaptations ? same RDD implementation,
>>> but depending on the latest 2.x version with the minimal changes required.
>>> I'd be happy to do that, or guide anyone who wants to (I did most of it on
>>> my branch for Spark 2 anyway) but since it's a branch and not on master (I
>>> don't believe it "deserves" a place on master), it would always be a bit
>>> behind since we would have to rebase and merge once in a while.
>>>
>>> How does that sound ?
>>>
>>> On Wed, Mar 15, 2017 at 7:49 PM amarouni  wrote:
>>>
>>> +1 for Spark runners based on different APIs RDD/Dataset and keeping the
 Spark versions as a deployment dependency.

 The RDD API is stable & mature enough so it makes sense to have it on
 master, the Dataset API still have some work to do and from our own
 experience it just reached a comparable RDD API performance. The
 community is clearly heading in the Dataset API direction but the RDD
 API is still a viable option for most use cases.

 Just one quick question, today on master can we swap Spark 1.x by Spark
 2.x  and compile and use the Spark Runner ?

 Good question!
>>> I think this is the root cause of this problem - Spark 2 not only
>>> introduced a new API, but also broke a few such as: context is now
>>> session,
>>> Accumulators are AccumulatorV2, and this is what I recall right now.
>>> I don't think it's to hard to adapt those, and anyone who wants to could
>>> see how I did it on my branch:
>>> https://github.com/amitsela/beam/commit/8a1cf889d14d2b47e9e3
>>> 5bae742d78a290cbbdc9
>>>
>>>
>>>
 Thanks,

 Abbass,


 On 15/03/2017 17:57, Amit Sela wrote:

> So you're suggesting we copy-paste the current runner and adapt whatever
>
 is

> necessary so it runs with Spark 2 ?
> This also means any bug-fix / improvement would have to be maintained in
> two runners, and I wouldn't wanna do that.
>
> I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset
>
 API.

> Since the RDD API is mature, it should be the runner in master (not
> preventing another runner once Dataset API is mature enough) and the
> version (1.6.3 or 2.x) should be determined by the common installation.
>
> That's why I believe we still need to leave things as they are, but
> start
> working on the Dataset API runner.
> Otherwise, we'll have the current runner, another RDD API runner with
>
 Spark

> 2, and a third one for the Dataset API. I don't want to maintain all of
> them. It's a mess.
>
> On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía  wrote:
>
> However, I do feel that we should use the Dataset API, starting with
>> batch
>>
>>> support first. WDYT ?
>>>
>> Well, this is the exact current status quo, and it will take us some
>> time to have something as complete as what we have with the spark 1
>> runner for the spark 2.
>>
>> The other proposal has two advantages:
>>
>> One is that we can leverage on the existing implementation (with the
>> needed adjustments) to run Beam pipelines on Spark 

Re: Beam spark 2.x runner status

2017-03-16 Thread Cody Innowhere
I'm personally in favor of maintaining one single branch, e.g.,
spark-runner, which supports both Spark 1.6 & 2.1.
Since there's currently no DataFrame support in spark 1.x runner, there
should be no conflicts if we put two versions of Spark into one runner.

I'm also +1 for adding adapters in the branch to support both Spark
versions.

Also, we can have two translators, say, 1.x translator which translates
into RDDs & DataStreams and 2.x translator which translates into DataSets.

On Thu, Mar 16, 2017 at 9:33 AM, Jean-Baptiste Onofré 
wrote:

> Hi guys,
>
> sorry, due to the time zone shift, I answer a bit late ;)
>
> I think we can have the same runner dealing with the two major Spark
> version, introducing some adapters. For instance, in CarbonData, we created
> some adapters to work with Spark 1?5, Spark 1.6 and Spark 2.1. The
> dependencies come from Maven profiles. Of course, it's easier there as it's
> more "user" code.
>
> My proposal is just it's worth to try ;)
>
> I just created a branch to experiment a bit and have more details.
>
> Regards
> JB
>
>
> On 03/16/2017 02:31 AM, Amit Sela wrote:
>
>> I answered inline to Abbass' comment, but I think he hit something - how
>> about we have a branch with those adaptations ? same RDD implementation,
>> but depending on the latest 2.x version with the minimal changes required.
>> I'd be happy to do that, or guide anyone who wants to (I did most of it on
>> my branch for Spark 2 anyway) but since it's a branch and not on master (I
>> don't believe it "deserves" a place on master), it would always be a bit
>> behind since we would have to rebase and merge once in a while.
>>
>> How does that sound ?
>>
>> On Wed, Mar 15, 2017 at 7:49 PM amarouni  wrote:
>>
>> +1 for Spark runners based on different APIs RDD/Dataset and keeping the
>>> Spark versions as a deployment dependency.
>>>
>>> The RDD API is stable & mature enough so it makes sense to have it on
>>> master, the Dataset API still have some work to do and from our own
>>> experience it just reached a comparable RDD API performance. The
>>> community is clearly heading in the Dataset API direction but the RDD
>>> API is still a viable option for most use cases.
>>>
>>> Just one quick question, today on master can we swap Spark 1.x by Spark
>>> 2.x  and compile and use the Spark Runner ?
>>>
>>> Good question!
>> I think this is the root cause of this problem - Spark 2 not only
>> introduced a new API, but also broke a few such as: context is now
>> session,
>> Accumulators are AccumulatorV2, and this is what I recall right now.
>> I don't think it's to hard to adapt those, and anyone who wants to could
>> see how I did it on my branch:
>> https://github.com/amitsela/beam/commit/8a1cf889d14d2b47e9e3
>> 5bae742d78a290cbbdc9
>>
>>
>>
>>> Thanks,
>>>
>>> Abbass,
>>>
>>>
>>> On 15/03/2017 17:57, Amit Sela wrote:
>>>
 So you're suggesting we copy-paste the current runner and adapt whatever

>>> is
>>>
 necessary so it runs with Spark 2 ?
 This also means any bug-fix / improvement would have to be maintained in
 two runners, and I wouldn't wanna do that.

 I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset

>>> API.
>>>
 Since the RDD API is mature, it should be the runner in master (not
 preventing another runner once Dataset API is mature enough) and the
 version (1.6.3 or 2.x) should be determined by the common installation.

 That's why I believe we still need to leave things as they are, but
 start
 working on the Dataset API runner.
 Otherwise, we'll have the current runner, another RDD API runner with

>>> Spark
>>>
 2, and a third one for the Dataset API. I don't want to maintain all of
 them. It's a mess.

 On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía  wrote:

 However, I do feel that we should use the Dataset API, starting with
>>
> batch
>
>> support first. WDYT ?
>>
> Well, this is the exact current status quo, and it will take us some
> time to have something as complete as what we have with the spark 1
> runner for the spark 2.
>
> The other proposal has two advantages:
>
> One is that we can leverage on the existing implementation (with the
> needed adjustments) to run Beam pipelines on Spark 2, in the end final
> users don’t care so much if pipelines are translated via RDD/DStream
> or Dataset, they just want to know that with Beam they can run their
> code in their favorite data processing framework.
>
> The other advantage is that we can base the work on the latest spark
> version and advance simultaneously in translators for both APIs, and
> once we consider that the DataSet is mature enough we can stop
> maintaining the RDD one and make it the official one.
>
> The only missing piece is backporting new developments on the RDD

Re: Beam spark 2.x runner status

2017-03-15 Thread Jean-Baptiste Onofré

Hi guys,

sorry, due to the time zone shift, I answer a bit late ;)

I think we can have the same runner dealing with the two major Spark version, 
introducing some adapters. For instance, in CarbonData, we created some adapters 
to work with Spark 1?5, Spark 1.6 and Spark 2.1. The dependencies come from 
Maven profiles. Of course, it's easier there as it's more "user" code.


My proposal is just it's worth to try ;)

I just created a branch to experiment a bit and have more details.

Regards
JB

On 03/16/2017 02:31 AM, Amit Sela wrote:

I answered inline to Abbass' comment, but I think he hit something - how
about we have a branch with those adaptations ? same RDD implementation,
but depending on the latest 2.x version with the minimal changes required.
I'd be happy to do that, or guide anyone who wants to (I did most of it on
my branch for Spark 2 anyway) but since it's a branch and not on master (I
don't believe it "deserves" a place on master), it would always be a bit
behind since we would have to rebase and merge once in a while.

How does that sound ?

On Wed, Mar 15, 2017 at 7:49 PM amarouni  wrote:


+1 for Spark runners based on different APIs RDD/Dataset and keeping the
Spark versions as a deployment dependency.

The RDD API is stable & mature enough so it makes sense to have it on
master, the Dataset API still have some work to do and from our own
experience it just reached a comparable RDD API performance. The
community is clearly heading in the Dataset API direction but the RDD
API is still a viable option for most use cases.

Just one quick question, today on master can we swap Spark 1.x by Spark
2.x  and compile and use the Spark Runner ?


Good question!
I think this is the root cause of this problem - Spark 2 not only
introduced a new API, but also broke a few such as: context is now session,
Accumulators are AccumulatorV2, and this is what I recall right now.
I don't think it's to hard to adapt those, and anyone who wants to could
see how I did it on my branch:
https://github.com/amitsela/beam/commit/8a1cf889d14d2b47e9e35bae742d78a290cbbdc9




Thanks,

Abbass,


On 15/03/2017 17:57, Amit Sela wrote:

So you're suggesting we copy-paste the current runner and adapt whatever

is

necessary so it runs with Spark 2 ?
This also means any bug-fix / improvement would have to be maintained in
two runners, and I wouldn't wanna do that.

I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset

API.

Since the RDD API is mature, it should be the runner in master (not
preventing another runner once Dataset API is mature enough) and the
version (1.6.3 or 2.x) should be determined by the common installation.

That's why I believe we still need to leave things as they are, but start
working on the Dataset API runner.
Otherwise, we'll have the current runner, another RDD API runner with

Spark

2, and a third one for the Dataset API. I don't want to maintain all of
them. It's a mess.

On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía  wrote:


However, I do feel that we should use the Dataset API, starting with

batch

support first. WDYT ?

Well, this is the exact current status quo, and it will take us some
time to have something as complete as what we have with the spark 1
runner for the spark 2.

The other proposal has two advantages:

One is that we can leverage on the existing implementation (with the
needed adjustments) to run Beam pipelines on Spark 2, in the end final
users don’t care so much if pipelines are translated via RDD/DStream
or Dataset, they just want to know that with Beam they can run their
code in their favorite data processing framework.

The other advantage is that we can base the work on the latest spark
version and advance simultaneously in translators for both APIs, and
once we consider that the DataSet is mature enough we can stop
maintaining the RDD one and make it the official one.

The only missing piece is backporting new developments on the RDD
based translator from the spark 2 version into the spark 1, but maybe
this won’t be so hard if we consider what you said, that at this point
we are getting closer to have streaming right (of course you are the
most appropriate person to decide if we are in a sufficient good shape
to make this, so backporting things won’t be so hard).

Finally I agree with you, I would prefer a nice and full featured
translator based on the Structured Streaming API but the question is
how much time this will take to be in shape and the impact on final
users who are already requesting this. This is the reason why I think
the more conservative approach (keeping around the RDD translator) and
moving incrementally makes sense.

On Wed, Mar 15, 2017 at 4:52 PM, Amit Sela 

wrote:

I feel that as we're getting closer to supporting streaming with Spark

1

runner, and having Structured Streaming advance in Spark 2, we could

start

work on Spark 2 runner in a separate branch.


Re: Beam spark 2.x runner status

2017-03-15 Thread Amit Sela
I answered inline to Abbass' comment, but I think he hit something - how
about we have a branch with those adaptations ? same RDD implementation,
but depending on the latest 2.x version with the minimal changes required.
I'd be happy to do that, or guide anyone who wants to (I did most of it on
my branch for Spark 2 anyway) but since it's a branch and not on master (I
don't believe it "deserves" a place on master), it would always be a bit
behind since we would have to rebase and merge once in a while.

How does that sound ?

On Wed, Mar 15, 2017 at 7:49 PM amarouni  wrote:

> +1 for Spark runners based on different APIs RDD/Dataset and keeping the
> Spark versions as a deployment dependency.
>
> The RDD API is stable & mature enough so it makes sense to have it on
> master, the Dataset API still have some work to do and from our own
> experience it just reached a comparable RDD API performance. The
> community is clearly heading in the Dataset API direction but the RDD
> API is still a viable option for most use cases.
>
> Just one quick question, today on master can we swap Spark 1.x by Spark
> 2.x  and compile and use the Spark Runner ?
>
Good question!
I think this is the root cause of this problem - Spark 2 not only
introduced a new API, but also broke a few such as: context is now session,
Accumulators are AccumulatorV2, and this is what I recall right now.
I don't think it's to hard to adapt those, and anyone who wants to could
see how I did it on my branch:
https://github.com/amitsela/beam/commit/8a1cf889d14d2b47e9e35bae742d78a290cbbdc9


>
> Thanks,
>
> Abbass,
>
>
> On 15/03/2017 17:57, Amit Sela wrote:
> > So you're suggesting we copy-paste the current runner and adapt whatever
> is
> > necessary so it runs with Spark 2 ?
> > This also means any bug-fix / improvement would have to be maintained in
> > two runners, and I wouldn't wanna do that.
> >
> > I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset
> API.
> > Since the RDD API is mature, it should be the runner in master (not
> > preventing another runner once Dataset API is mature enough) and the
> > version (1.6.3 or 2.x) should be determined by the common installation.
> >
> > That's why I believe we still need to leave things as they are, but start
> > working on the Dataset API runner.
> > Otherwise, we'll have the current runner, another RDD API runner with
> Spark
> > 2, and a third one for the Dataset API. I don't want to maintain all of
> > them. It's a mess.
> >
> > On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía  wrote:
> >
> >>> However, I do feel that we should use the Dataset API, starting with
> >> batch
> >>> support first. WDYT ?
> >> Well, this is the exact current status quo, and it will take us some
> >> time to have something as complete as what we have with the spark 1
> >> runner for the spark 2.
> >>
> >> The other proposal has two advantages:
> >>
> >> One is that we can leverage on the existing implementation (with the
> >> needed adjustments) to run Beam pipelines on Spark 2, in the end final
> >> users don’t care so much if pipelines are translated via RDD/DStream
> >> or Dataset, they just want to know that with Beam they can run their
> >> code in their favorite data processing framework.
> >>
> >> The other advantage is that we can base the work on the latest spark
> >> version and advance simultaneously in translators for both APIs, and
> >> once we consider that the DataSet is mature enough we can stop
> >> maintaining the RDD one and make it the official one.
> >>
> >> The only missing piece is backporting new developments on the RDD
> >> based translator from the spark 2 version into the spark 1, but maybe
> >> this won’t be so hard if we consider what you said, that at this point
> >> we are getting closer to have streaming right (of course you are the
> >> most appropriate person to decide if we are in a sufficient good shape
> >> to make this, so backporting things won’t be so hard).
> >>
> >> Finally I agree with you, I would prefer a nice and full featured
> >> translator based on the Structured Streaming API but the question is
> >> how much time this will take to be in shape and the impact on final
> >> users who are already requesting this. This is the reason why I think
> >> the more conservative approach (keeping around the RDD translator) and
> >> moving incrementally makes sense.
> >>
> >> On Wed, Mar 15, 2017 at 4:52 PM, Amit Sela 
> wrote:
> >>> I feel that as we're getting closer to supporting streaming with Spark
> 1
> >>> runner, and having Structured Streaming advance in Spark 2, we could
> >> start
> >>> work on Spark 2 runner in a separate branch.
> >>>
> >>> However, I do feel that we should use the Dataset API, starting with
> >> batch
> >>> support first. WDYT ?
> >>>
> >>> On Wed, Mar 15, 2017 at 5:47 PM Ismaël Mejía 
> wrote:
> >>>
> > So you propose to have the Spark 2 branch a 

Re: Beam spark 2.x runner status

2017-03-15 Thread amarouni
+1 for Spark runners based on different APIs RDD/Dataset and keeping the
Spark versions as a deployment dependency.

The RDD API is stable & mature enough so it makes sense to have it on
master, the Dataset API still have some work to do and from our own
experience it just reached a comparable RDD API performance. The
community is clearly heading in the Dataset API direction but the RDD
API is still a viable option for most use cases.

Just one quick question, today on master can we swap Spark 1.x by Spark
2.x  and compile and use the Spark Runner ?

Thanks,

Abbass,


On 15/03/2017 17:57, Amit Sela wrote:
> So you're suggesting we copy-paste the current runner and adapt whatever is
> necessary so it runs with Spark 2 ?
> This also means any bug-fix / improvement would have to be maintained in
> two runners, and I wouldn't wanna do that.
>
> I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset API.
> Since the RDD API is mature, it should be the runner in master (not
> preventing another runner once Dataset API is mature enough) and the
> version (1.6.3 or 2.x) should be determined by the common installation.
>
> That's why I believe we still need to leave things as they are, but start
> working on the Dataset API runner.
> Otherwise, we'll have the current runner, another RDD API runner with Spark
> 2, and a third one for the Dataset API. I don't want to maintain all of
> them. It's a mess.
>
> On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía  wrote:
>
>>> However, I do feel that we should use the Dataset API, starting with
>> batch
>>> support first. WDYT ?
>> Well, this is the exact current status quo, and it will take us some
>> time to have something as complete as what we have with the spark 1
>> runner for the spark 2.
>>
>> The other proposal has two advantages:
>>
>> One is that we can leverage on the existing implementation (with the
>> needed adjustments) to run Beam pipelines on Spark 2, in the end final
>> users don’t care so much if pipelines are translated via RDD/DStream
>> or Dataset, they just want to know that with Beam they can run their
>> code in their favorite data processing framework.
>>
>> The other advantage is that we can base the work on the latest spark
>> version and advance simultaneously in translators for both APIs, and
>> once we consider that the DataSet is mature enough we can stop
>> maintaining the RDD one and make it the official one.
>>
>> The only missing piece is backporting new developments on the RDD
>> based translator from the spark 2 version into the spark 1, but maybe
>> this won’t be so hard if we consider what you said, that at this point
>> we are getting closer to have streaming right (of course you are the
>> most appropriate person to decide if we are in a sufficient good shape
>> to make this, so backporting things won’t be so hard).
>>
>> Finally I agree with you, I would prefer a nice and full featured
>> translator based on the Structured Streaming API but the question is
>> how much time this will take to be in shape and the impact on final
>> users who are already requesting this. This is the reason why I think
>> the more conservative approach (keeping around the RDD translator) and
>> moving incrementally makes sense.
>>
>> On Wed, Mar 15, 2017 at 4:52 PM, Amit Sela  wrote:
>>> I feel that as we're getting closer to supporting streaming with Spark 1
>>> runner, and having Structured Streaming advance in Spark 2, we could
>> start
>>> work on Spark 2 runner in a separate branch.
>>>
>>> However, I do feel that we should use the Dataset API, starting with
>> batch
>>> support first. WDYT ?
>>>
>>> On Wed, Mar 15, 2017 at 5:47 PM Ismaël Mejía  wrote:
>>>
> So you propose to have the Spark 2 branch a clone of the current one
>> with
> adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
 while
> still using the RDD API ?
 Yes this is exactly what I have in mind.

> I think that having another Spark runner is great if it has value,
> otherwise, let's just bump the version.
 There is value because most people are already starting to move to
 spark 2 and all Big Data distribution providers support it now, as
 well as the Cloud-based distributions (Dataproc and EMR) not like the
 last time we had this discussion.

> We could think of starting to migrate the Spark 1 runner to Spark 2
>> and
> follow with Dataset API support feature-by-feature as ot advances,
>> but I
> think most Spark installations today still run 1.X, or am I wrong ?
 No, you are right, that’s why I didn’t even mentioned removing the
 spark 1 runner, I know that having to support things for both versions
 can add additional work for us, but maybe the best approach would be
 to continue the work only in the spark 2 runner (both refining the RDD
 based translator and starting to create the Dataset one there that

Re: Beam spark 2.x runner status

2017-03-15 Thread Amit Sela
So you're suggesting we copy-paste the current runner and adapt whatever is
necessary so it runs with Spark 2 ?
This also means any bug-fix / improvement would have to be maintained in
two runners, and I wouldn't wanna do that.

I don't like to think in terms of Spark1/2 but in terms of RDD/Dataset API.
Since the RDD API is mature, it should be the runner in master (not
preventing another runner once Dataset API is mature enough) and the
version (1.6.3 or 2.x) should be determined by the common installation.

That's why I believe we still need to leave things as they are, but start
working on the Dataset API runner.
Otherwise, we'll have the current runner, another RDD API runner with Spark
2, and a third one for the Dataset API. I don't want to maintain all of
them. It's a mess.

On Wed, Mar 15, 2017 at 6:39 PM Ismaël Mejía  wrote:

> > However, I do feel that we should use the Dataset API, starting with
> batch
> > support first. WDYT ?
>
> Well, this is the exact current status quo, and it will take us some
> time to have something as complete as what we have with the spark 1
> runner for the spark 2.
>
> The other proposal has two advantages:
>
> One is that we can leverage on the existing implementation (with the
> needed adjustments) to run Beam pipelines on Spark 2, in the end final
> users don’t care so much if pipelines are translated via RDD/DStream
> or Dataset, they just want to know that with Beam they can run their
> code in their favorite data processing framework.
>
> The other advantage is that we can base the work on the latest spark
> version and advance simultaneously in translators for both APIs, and
> once we consider that the DataSet is mature enough we can stop
> maintaining the RDD one and make it the official one.
>
> The only missing piece is backporting new developments on the RDD
> based translator from the spark 2 version into the spark 1, but maybe
> this won’t be so hard if we consider what you said, that at this point
> we are getting closer to have streaming right (of course you are the
> most appropriate person to decide if we are in a sufficient good shape
> to make this, so backporting things won’t be so hard).
>
> Finally I agree with you, I would prefer a nice and full featured
> translator based on the Structured Streaming API but the question is
> how much time this will take to be in shape and the impact on final
> users who are already requesting this. This is the reason why I think
> the more conservative approach (keeping around the RDD translator) and
> moving incrementally makes sense.
>
> On Wed, Mar 15, 2017 at 4:52 PM, Amit Sela  wrote:
> > I feel that as we're getting closer to supporting streaming with Spark 1
> > runner, and having Structured Streaming advance in Spark 2, we could
> start
> > work on Spark 2 runner in a separate branch.
> >
> > However, I do feel that we should use the Dataset API, starting with
> batch
> > support first. WDYT ?
> >
> > On Wed, Mar 15, 2017 at 5:47 PM Ismaël Mejía  wrote:
> >
> >> > So you propose to have the Spark 2 branch a clone of the current one
> with
> >> > adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
> >> while
> >> > still using the RDD API ?
> >>
> >> Yes this is exactly what I have in mind.
> >>
> >> > I think that having another Spark runner is great if it has value,
> >> > otherwise, let's just bump the version.
> >>
> >> There is value because most people are already starting to move to
> >> spark 2 and all Big Data distribution providers support it now, as
> >> well as the Cloud-based distributions (Dataproc and EMR) not like the
> >> last time we had this discussion.
> >>
> >> > We could think of starting to migrate the Spark 1 runner to Spark 2
> and
> >> > follow with Dataset API support feature-by-feature as ot advances,
> but I
> >> > think most Spark installations today still run 1.X, or am I wrong ?
> >>
> >> No, you are right, that’s why I didn’t even mentioned removing the
> >> spark 1 runner, I know that having to support things for both versions
> >> can add additional work for us, but maybe the best approach would be
> >> to continue the work only in the spark 2 runner (both refining the RDD
> >> based translator and starting to create the Dataset one there that
> >> co-exist until the DataSet API is mature enough) and keep the spark 1
> >> runner only for bug-fixes for the users who are still using it (like
> >> this we don’t have to keep backporting stuff). Do you see any other
> >> particular issue?
> >>
> >> Ismaël
> >>
> >> On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela 
> wrote:
> >> > So you propose to have the Spark 2 branch a clone of the current one
> with
> >> > adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
> >> while
> >> > still using the RDD API ?
> >> >
> >> > I think that having another Spark runner is great if it has value,
> >> > otherwise, let's just 

Re: Beam spark 2.x runner status

2017-03-15 Thread Ismaël Mejía
> However, I do feel that we should use the Dataset API, starting with batch
> support first. WDYT ?

Well, this is the exact current status quo, and it will take us some
time to have something as complete as what we have with the spark 1
runner for the spark 2.

The other proposal has two advantages:

One is that we can leverage on the existing implementation (with the
needed adjustments) to run Beam pipelines on Spark 2, in the end final
users don’t care so much if pipelines are translated via RDD/DStream
or Dataset, they just want to know that with Beam they can run their
code in their favorite data processing framework.

The other advantage is that we can base the work on the latest spark
version and advance simultaneously in translators for both APIs, and
once we consider that the DataSet is mature enough we can stop
maintaining the RDD one and make it the official one.

The only missing piece is backporting new developments on the RDD
based translator from the spark 2 version into the spark 1, but maybe
this won’t be so hard if we consider what you said, that at this point
we are getting closer to have streaming right (of course you are the
most appropriate person to decide if we are in a sufficient good shape
to make this, so backporting things won’t be so hard).

Finally I agree with you, I would prefer a nice and full featured
translator based on the Structured Streaming API but the question is
how much time this will take to be in shape and the impact on final
users who are already requesting this. This is the reason why I think
the more conservative approach (keeping around the RDD translator) and
moving incrementally makes sense.

On Wed, Mar 15, 2017 at 4:52 PM, Amit Sela  wrote:
> I feel that as we're getting closer to supporting streaming with Spark 1
> runner, and having Structured Streaming advance in Spark 2, we could start
> work on Spark 2 runner in a separate branch.
>
> However, I do feel that we should use the Dataset API, starting with batch
> support first. WDYT ?
>
> On Wed, Mar 15, 2017 at 5:47 PM Ismaël Mejía  wrote:
>
>> > So you propose to have the Spark 2 branch a clone of the current one with
>> > adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
>> while
>> > still using the RDD API ?
>>
>> Yes this is exactly what I have in mind.
>>
>> > I think that having another Spark runner is great if it has value,
>> > otherwise, let's just bump the version.
>>
>> There is value because most people are already starting to move to
>> spark 2 and all Big Data distribution providers support it now, as
>> well as the Cloud-based distributions (Dataproc and EMR) not like the
>> last time we had this discussion.
>>
>> > We could think of starting to migrate the Spark 1 runner to Spark 2 and
>> > follow with Dataset API support feature-by-feature as ot advances, but I
>> > think most Spark installations today still run 1.X, or am I wrong ?
>>
>> No, you are right, that’s why I didn’t even mentioned removing the
>> spark 1 runner, I know that having to support things for both versions
>> can add additional work for us, but maybe the best approach would be
>> to continue the work only in the spark 2 runner (both refining the RDD
>> based translator and starting to create the Dataset one there that
>> co-exist until the DataSet API is mature enough) and keep the spark 1
>> runner only for bug-fixes for the users who are still using it (like
>> this we don’t have to keep backporting stuff). Do you see any other
>> particular issue?
>>
>> Ismaël
>>
>> On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela  wrote:
>> > So you propose to have the Spark 2 branch a clone of the current one with
>> > adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
>> while
>> > still using the RDD API ?
>> >
>> > I think that having another Spark runner is great if it has value,
>> > otherwise, let's just bump the version.
>> > My idea of having another runner for Spark was not to support more
>> versions
>> > - we should always support the most popular version in terms of
>> > compatibility - the idea was to try and make Beam work with Structured
>> > Streaming, which is still not fully mature so that's why we're not
>> heavily
>> > investing there.
>> >
>> > We could think of starting to migrate the Spark 1 runner to Spark 2 and
>> > follow with Dataset API support feature-by-feature as ot advances, but I
>> > think most Spark installations today still run 1.X, or am I wrong ?
>> >
>> > On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejía  wrote:
>> >
>> >> BIG +1 JB,
>> >>
>> >> If we can just jump the version number with minor changes staying as
>> >> close as possible to the current implementation for spark 1 we can go
>> >> faster and offer in principle the exact same support but for version
>> >> 2.
>> >>
>> >> I know that the advanced streaming stuff based on the DataSet API
>> >> won't be there but with 

Re: Beam spark 2.x runner status

2017-03-15 Thread Amit Sela
I feel that as we're getting closer to supporting streaming with Spark 1
runner, and having Structured Streaming advance in Spark 2, we could start
work on Spark 2 runner in a separate branch.

However, I do feel that we should use the Dataset API, starting with batch
support first. WDYT ?

On Wed, Mar 15, 2017 at 5:47 PM Ismaël Mejía  wrote:

> > So you propose to have the Spark 2 branch a clone of the current one with
> > adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
> while
> > still using the RDD API ?
>
> Yes this is exactly what I have in mind.
>
> > I think that having another Spark runner is great if it has value,
> > otherwise, let's just bump the version.
>
> There is value because most people are already starting to move to
> spark 2 and all Big Data distribution providers support it now, as
> well as the Cloud-based distributions (Dataproc and EMR) not like the
> last time we had this discussion.
>
> > We could think of starting to migrate the Spark 1 runner to Spark 2 and
> > follow with Dataset API support feature-by-feature as ot advances, but I
> > think most Spark installations today still run 1.X, or am I wrong ?
>
> No, you are right, that’s why I didn’t even mentioned removing the
> spark 1 runner, I know that having to support things for both versions
> can add additional work for us, but maybe the best approach would be
> to continue the work only in the spark 2 runner (both refining the RDD
> based translator and starting to create the Dataset one there that
> co-exist until the DataSet API is mature enough) and keep the spark 1
> runner only for bug-fixes for the users who are still using it (like
> this we don’t have to keep backporting stuff). Do you see any other
> particular issue?
>
> Ismaël
>
> On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela  wrote:
> > So you propose to have the Spark 2 branch a clone of the current one with
> > adaptations around Context->Session, Accumulator->AccumulatorV2 etc.
> while
> > still using the RDD API ?
> >
> > I think that having another Spark runner is great if it has value,
> > otherwise, let's just bump the version.
> > My idea of having another runner for Spark was not to support more
> versions
> > - we should always support the most popular version in terms of
> > compatibility - the idea was to try and make Beam work with Structured
> > Streaming, which is still not fully mature so that's why we're not
> heavily
> > investing there.
> >
> > We could think of starting to migrate the Spark 1 runner to Spark 2 and
> > follow with Dataset API support feature-by-feature as ot advances, but I
> > think most Spark installations today still run 1.X, or am I wrong ?
> >
> > On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejía  wrote:
> >
> >> BIG +1 JB,
> >>
> >> If we can just jump the version number with minor changes staying as
> >> close as possible to the current implementation for spark 1 we can go
> >> faster and offer in principle the exact same support but for version
> >> 2.
> >>
> >> I know that the advanced streaming stuff based on the DataSet API
> >> won't be there but with this common canvas the community can iterate
> >> to create a DataSet based translator at the same time. In particular I
> >> consider the most important thing is that the spark 2 branch should
> >> not live for long time, this should be merged into master really fast
> >> for the benefit of everybody.
> >>
> >> Ismaël
> >>
> >>
> >> On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré 
> >> wrote:
> >> > Hi Amit,
> >> >
> >> > What do you think of the following:
> >> >
> >> > - in the mean time that you reintroduce the Spark 2 branch, what about
> >> > "extending" the version in the current Spark runner ? Still using
> >> > RDD/DStream, I think we can support Spark 2.x even if we don't yet
> >> leverage
> >> > the new provided features.
> >> >
> >> > Thoughts ?
> >> >
> >> > Regards
> >> > JB
> >> >
> >> >
> >> > On 03/15/2017 07:39 PM, Amit Sela wrote:
> >> >>
> >> >> Hi Cody,
> >> >>
> >> >> I will re-introduce this branch soon as part of the work on BEAM-913
> >> >> .
> >> >> For now, and from previous experience with the mentioned branch,
> batch
> >> >> implementation should be straight-forward.
> >> >> Only issue is with streaming support - in the current runner (Spark
> 1.x)
> >> >> we
> >> >> have experimental support for windows/triggers and we're working
> towards
> >> >> full streaming support.
> >> >> With Spark 2.x, there is no "general-purpose" stateful operator for
> the
> >> >> Dataset API, so I was waiting to see if the new operator
> >> >>  planned for next
> version
> >> >> could
> >> >> help with that.
> >> >>
> >> >> To summarize, I will introduce a skeleton for the Spark 2 runner with
> >> >> batch
> >> >> support as soon as I can as a separate branch.
> >> >>

Re: Beam spark 2.x runner status

2017-03-15 Thread Ismaël Mejía
> So you propose to have the Spark 2 branch a clone of the current one with
> adaptations around Context->Session, Accumulator->AccumulatorV2 etc. while
> still using the RDD API ?

Yes this is exactly what I have in mind.

> I think that having another Spark runner is great if it has value,
> otherwise, let's just bump the version.

There is value because most people are already starting to move to
spark 2 and all Big Data distribution providers support it now, as
well as the Cloud-based distributions (Dataproc and EMR) not like the
last time we had this discussion.

> We could think of starting to migrate the Spark 1 runner to Spark 2 and
> follow with Dataset API support feature-by-feature as ot advances, but I
> think most Spark installations today still run 1.X, or am I wrong ?

No, you are right, that’s why I didn’t even mentioned removing the
spark 1 runner, I know that having to support things for both versions
can add additional work for us, but maybe the best approach would be
to continue the work only in the spark 2 runner (both refining the RDD
based translator and starting to create the Dataset one there that
co-exist until the DataSet API is mature enough) and keep the spark 1
runner only for bug-fixes for the users who are still using it (like
this we don’t have to keep backporting stuff). Do you see any other
particular issue?

Ismaël

On Wed, Mar 15, 2017 at 3:39 PM, Amit Sela  wrote:
> So you propose to have the Spark 2 branch a clone of the current one with
> adaptations around Context->Session, Accumulator->AccumulatorV2 etc. while
> still using the RDD API ?
>
> I think that having another Spark runner is great if it has value,
> otherwise, let's just bump the version.
> My idea of having another runner for Spark was not to support more versions
> - we should always support the most popular version in terms of
> compatibility - the idea was to try and make Beam work with Structured
> Streaming, which is still not fully mature so that's why we're not heavily
> investing there.
>
> We could think of starting to migrate the Spark 1 runner to Spark 2 and
> follow with Dataset API support feature-by-feature as ot advances, but I
> think most Spark installations today still run 1.X, or am I wrong ?
>
> On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejía  wrote:
>
>> BIG +1 JB,
>>
>> If we can just jump the version number with minor changes staying as
>> close as possible to the current implementation for spark 1 we can go
>> faster and offer in principle the exact same support but for version
>> 2.
>>
>> I know that the advanced streaming stuff based on the DataSet API
>> won't be there but with this common canvas the community can iterate
>> to create a DataSet based translator at the same time. In particular I
>> consider the most important thing is that the spark 2 branch should
>> not live for long time, this should be merged into master really fast
>> for the benefit of everybody.
>>
>> Ismaël
>>
>>
>> On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré 
>> wrote:
>> > Hi Amit,
>> >
>> > What do you think of the following:
>> >
>> > - in the mean time that you reintroduce the Spark 2 branch, what about
>> > "extending" the version in the current Spark runner ? Still using
>> > RDD/DStream, I think we can support Spark 2.x even if we don't yet
>> leverage
>> > the new provided features.
>> >
>> > Thoughts ?
>> >
>> > Regards
>> > JB
>> >
>> >
>> > On 03/15/2017 07:39 PM, Amit Sela wrote:
>> >>
>> >> Hi Cody,
>> >>
>> >> I will re-introduce this branch soon as part of the work on BEAM-913
>> >> .
>> >> For now, and from previous experience with the mentioned branch, batch
>> >> implementation should be straight-forward.
>> >> Only issue is with streaming support - in the current runner (Spark 1.x)
>> >> we
>> >> have experimental support for windows/triggers and we're working towards
>> >> full streaming support.
>> >> With Spark 2.x, there is no "general-purpose" stateful operator for the
>> >> Dataset API, so I was waiting to see if the new operator
>> >>  planned for next version
>> >> could
>> >> help with that.
>> >>
>> >> To summarize, I will introduce a skeleton for the Spark 2 runner with
>> >> batch
>> >> support as soon as I can as a separate branch.
>> >>
>> >> Thanks,
>> >> Amit
>> >>
>> >> On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere 
>> >> wrote:
>> >>
>> >>> Hi guys,
>> >>> Is there anybody who's currently working on Spark 2.x runner? A old PR
>> >>> for
>> >>> spark 2.x runner was closed a few days ago, so I wonder what's the
>> status
>> >>> now, and is there a roadmap for this?
>> >>> Thanks~
>> >>>
>> >>
>> >
>> > --
>> > Jean-Baptiste Onofré
>> > jbono...@apache.org
>> > http://blog.nanthrax.net
>> > Talend - http://www.talend.com
>>


Re: Beam spark 2.x runner status

2017-03-15 Thread Amit Sela
So you propose to have the Spark 2 branch a clone of the current one with
adaptations around Context->Session, Accumulator->AccumulatorV2 etc. while
still using the RDD API ?

I think that having another Spark runner is great if it has value,
otherwise, let's just bump the version.
My idea of having another runner for Spark was not to support more versions
- we should always support the most popular version in terms of
compatibility - the idea was to try and make Beam work with Structured
Streaming, which is still not fully mature so that's why we're not heavily
investing there.

We could think of starting to migrate the Spark 1 runner to Spark 2 and
follow with Dataset API support feature-by-feature as ot advances, but I
think most Spark installations today still run 1.X, or am I wrong ?

On Wed, Mar 15, 2017 at 4:26 PM Ismaël Mejía  wrote:

> BIG +1 JB,
>
> If we can just jump the version number with minor changes staying as
> close as possible to the current implementation for spark 1 we can go
> faster and offer in principle the exact same support but for version
> 2.
>
> I know that the advanced streaming stuff based on the DataSet API
> won't be there but with this common canvas the community can iterate
> to create a DataSet based translator at the same time. In particular I
> consider the most important thing is that the spark 2 branch should
> not live for long time, this should be merged into master really fast
> for the benefit of everybody.
>
> Ismaël
>
>
> On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré 
> wrote:
> > Hi Amit,
> >
> > What do you think of the following:
> >
> > - in the mean time that you reintroduce the Spark 2 branch, what about
> > "extending" the version in the current Spark runner ? Still using
> > RDD/DStream, I think we can support Spark 2.x even if we don't yet
> leverage
> > the new provided features.
> >
> > Thoughts ?
> >
> > Regards
> > JB
> >
> >
> > On 03/15/2017 07:39 PM, Amit Sela wrote:
> >>
> >> Hi Cody,
> >>
> >> I will re-introduce this branch soon as part of the work on BEAM-913
> >> .
> >> For now, and from previous experience with the mentioned branch, batch
> >> implementation should be straight-forward.
> >> Only issue is with streaming support - in the current runner (Spark 1.x)
> >> we
> >> have experimental support for windows/triggers and we're working towards
> >> full streaming support.
> >> With Spark 2.x, there is no "general-purpose" stateful operator for the
> >> Dataset API, so I was waiting to see if the new operator
> >>  planned for next version
> >> could
> >> help with that.
> >>
> >> To summarize, I will introduce a skeleton for the Spark 2 runner with
> >> batch
> >> support as soon as I can as a separate branch.
> >>
> >> Thanks,
> >> Amit
> >>
> >> On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere 
> >> wrote:
> >>
> >>> Hi guys,
> >>> Is there anybody who's currently working on Spark 2.x runner? A old PR
> >>> for
> >>> spark 2.x runner was closed a few days ago, so I wonder what's the
> status
> >>> now, and is there a roadmap for this?
> >>> Thanks~
> >>>
> >>
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
>


Re: Beam spark 2.x runner status

2017-03-15 Thread Ismaël Mejía
BIG +1 JB,

If we can just jump the version number with minor changes staying as
close as possible to the current implementation for spark 1 we can go
faster and offer in principle the exact same support but for version
2.

I know that the advanced streaming stuff based on the DataSet API
won't be there but with this common canvas the community can iterate
to create a DataSet based translator at the same time. In particular I
consider the most important thing is that the spark 2 branch should
not live for long time, this should be merged into master really fast
for the benefit of everybody.

Ismaël


On Wed, Mar 15, 2017 at 1:57 PM, Jean-Baptiste Onofré  wrote:
> Hi Amit,
>
> What do you think of the following:
>
> - in the mean time that you reintroduce the Spark 2 branch, what about
> "extending" the version in the current Spark runner ? Still using
> RDD/DStream, I think we can support Spark 2.x even if we don't yet leverage
> the new provided features.
>
> Thoughts ?
>
> Regards
> JB
>
>
> On 03/15/2017 07:39 PM, Amit Sela wrote:
>>
>> Hi Cody,
>>
>> I will re-introduce this branch soon as part of the work on BEAM-913
>> .
>> For now, and from previous experience with the mentioned branch, batch
>> implementation should be straight-forward.
>> Only issue is with streaming support - in the current runner (Spark 1.x)
>> we
>> have experimental support for windows/triggers and we're working towards
>> full streaming support.
>> With Spark 2.x, there is no "general-purpose" stateful operator for the
>> Dataset API, so I was waiting to see if the new operator
>>  planned for next version
>> could
>> help with that.
>>
>> To summarize, I will introduce a skeleton for the Spark 2 runner with
>> batch
>> support as soon as I can as a separate branch.
>>
>> Thanks,
>> Amit
>>
>> On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere 
>> wrote:
>>
>>> Hi guys,
>>> Is there anybody who's currently working on Spark 2.x runner? A old PR
>>> for
>>> spark 2.x runner was closed a few days ago, so I wonder what's the status
>>> now, and is there a roadmap for this?
>>> Thanks~
>>>
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: Beam spark 2.x runner status

2017-03-15 Thread Jean-Baptiste Onofré

Hi Amit,

What do you think of the following:

- in the mean time that you reintroduce the Spark 2 branch, what about 
"extending" the version in the current Spark runner ? Still using RDD/DStream, I 
think we can support Spark 2.x even if we don't yet leverage the new provided 
features.


Thoughts ?

Regards
JB

On 03/15/2017 07:39 PM, Amit Sela wrote:

Hi Cody,

I will re-introduce this branch soon as part of the work on BEAM-913
.
For now, and from previous experience with the mentioned branch, batch
implementation should be straight-forward.
Only issue is with streaming support - in the current runner (Spark 1.x) we
have experimental support for windows/triggers and we're working towards
full streaming support.
With Spark 2.x, there is no "general-purpose" stateful operator for the
Dataset API, so I was waiting to see if the new operator
 planned for next version could
help with that.

To summarize, I will introduce a skeleton for the Spark 2 runner with batch
support as soon as I can as a separate branch.

Thanks,
Amit

On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere  wrote:


Hi guys,
Is there anybody who's currently working on Spark 2.x runner? A old PR for
spark 2.x runner was closed a few days ago, so I wonder what's the status
now, and is there a roadmap for this?
Thanks~





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Beam spark 2.x runner status

2017-03-15 Thread Amit Sela
Hi Cody,

I will re-introduce this branch soon as part of the work on BEAM-913
.
For now, and from previous experience with the mentioned branch, batch
implementation should be straight-forward.
Only issue is with streaming support - in the current runner (Spark 1.x) we
have experimental support for windows/triggers and we're working towards
full streaming support.
With Spark 2.x, there is no "general-purpose" stateful operator for the
Dataset API, so I was waiting to see if the new operator
 planned for next version could
help with that.

To summarize, I will introduce a skeleton for the Spark 2 runner with batch
support as soon as I can as a separate branch.

Thanks,
Amit

On Wed, Mar 15, 2017 at 9:07 AM Cody Innowhere  wrote:

> Hi guys,
> Is there anybody who's currently working on Spark 2.x runner? A old PR for
> spark 2.x runner was closed a few days ago, so I wonder what's the status
> now, and is there a roadmap for this?
> Thanks~
>