Re: Beam spark 2.x runner status

Amit Sela Wed, 29 Mar 2017 04:59:03 -0700

@Ted definitely makes sense.
@JB I'm merging https://github.com/apache/beam/pull/2354 soon so any
deprecated Spark API issues should be resolved.


On Wed, Mar 29, 2017 at 2:46 PM Ted Yu <[email protected]> wrote:

> This is what I did over HBASE-16179:
>
> -        f.call((asJavaIterator(it), conn)).iterator()
> +        // the return type is different in spark 1.x & 2.x, we handle both
> cases
> +        f.call(asJavaIterator(it), conn) match {
> +          // spark 1.x
> +          case iterable: Iterable[R] => iterable.iterator()
> +          // spark 2.x
> +          case iterator: Iterator[R] => iterator
> +        }
>        )
>
> FYI
>
> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela <[email protected]> wrote:
>
> > Just tried to replace dependencies and see what happens:
> >
> > Most required changes are about the runner using deprecated Spark APIs,
> and
> > after fixing them the only real issue is with the Java API for
> > Pair/FlatMapFunction that changed return value to Iterator (in 1.6 its
> > Iterable).
> >
> > So I'm not sure that a profile that simply sets dependency on 1.6.3/2.1.0
> > is feasible.
> >
> > On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant <[email protected]>
> > wrote:
> >
> > > So, if everything is in place in Spark 2.X and we use provided
> > dependencies
> > > for Spark in Beam.
> > > Theoretically, you can run the same code in 2.X without any need for a
> > > branch?
> > >
> > > 2017-03-23 9:47 GMT+02:00 Amit Sela <[email protected]>:
> > >
> > > > If StreamingContext is valid and we don't have to use SparkSession,
> and
> > > > Accumulators are valid as well and we don't need AccumulatorsV2, I
> > don't
> > > > see a reason this shouldn't work (which means there are still tons of
> > > > reasons this could break, but I can't think of them off the top of my
> > > head
> > > > right now).
> > > >
> > > > @JB simply add a profile for the Spark dependencies and run the
> tests -
> > > > you'll have a very definitive answer ;-) .
> > > > If this passes, try on a cluster running Spark 2 as well.
> > > >
> > > > Let me know of I can assist.
> > > >
> > > > On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
> [email protected]>
> > > > wrote:
> > > >
> > > > > Hi guys,
> > > > >
> > > > > Ismaël summarize well what I have in mind.
> > > > >
> > > > > I'm a bit late on the PoC around that (I started a branch already).
> > > > > I will move forward over the week end.
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
> > > > > > Amit, I suppose JB is talking about the RDD based version, so no
> > need
> > > > > > to worry about SparkSession or different incompatible APIs.
> > > > > >
> > > > > > Remember the idea we are discussing is to have in master both the
> > > > > > spark 1 and spark 2 runners using the RDD based translation. At
> the
> > > > > > same time we can have a feature branch to evolve the DataSet
> based
> > > > > > translator (this one will replace the RDD based translator for
> > spark
> > > 2
> > > > > > once it is mature).
> > > > > >
> > > > > > The advantages have been already discussed as well as the
> possible
> > > > > > issues so I think we have to see now if JB's idea is feasible and
> > how
> > > > > > hard would be to live with this while the DataSet version
> evolves.
> > > > > >
> > > > > > I think what we are trying to avoid is to have a long living
> branch
> > > > > > for a spark 2 runner based on RDD  because the maintenance burden
> > > > > > would be even worse. We would have to fight not only with the
> > double
> > > > > > merge of fixes (in case the profile idea does not work), but also
> > > with
> > > > > > the continue evolution of Beam and we would end up in the long
> > living
> > > > > > branch mess that others runners have dealt with (e.g. the Apex
> > > runner)
> > > > > >
> > > > > >
> > > > >
> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b89322ce541
> > > > 6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
> > > > > >
> > > > > > What do you think about this Amit ? Would you be ok to go with it
> > if
> > > > > > JB's profile idea proves to help with the msintenance issues ?
> > > > > >
> > > > > > Ismaël
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu <[email protected]>
> > wrote:
> > > > > >> hbase-spark module doesn't use SparkSession. So situation there
> is
> > > > > simpler
> > > > > >> :-)
> > > > > >>
> > > > > >> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela <
> [email protected]>
> > > > > wrote:
> > > > > >>
> > > > > >>> I'm still wondering how we'll do this - it's not just different
> > > > > >>> implementations of the same Class, but a completely different
> > > > concepts
> > > > > such
> > > > > >>> as using SparkSession in Spark 2 instead of
> > > > > SparkContext/StreamingContext
> > > > > >>> in Spark 1.
> > > > > >>>
> > > > > >>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu <[email protected]>
> > > wrote:
> > > > > >>>
> > > > > >>>> I have done some work over in HBASE-16179 where compatibility
> > > > modules
> > > > > are
> > > > > >>>> created to isolate changes in Spark 2.x API so that code in
> > > > > hbase-spark
> > > > > >>>> module can be reused.
> > > > > >>>>
> > > > > >>>> FYI
> > > > > >>>>
> > > > > >>>
> > > > >
> > > > > --
> > > > > Jean-Baptiste Onofré
> > > > > [email protected]
> > > > > http://blog.nanthrax.net
> > > > > Talend - http://www.talend.com
> > > > >
> > > >
> > >
> >
>

Re: Beam spark 2.x runner status

Reply via email to