Re: Beam spark 2.x runner status

Holden Karau Mon, 21 Aug 2017 11:58:27 -0700

I'd love to take a look at the PR when it comes in (<3 BEAM + SPARK :)).


On Mon, Aug 21, 2017 at 11:33 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi
>
> I did a new runner supporting spark 2.1.x. I changed code for that.
>
> I'm still in vacation this week. I will send an update when back.
>
> Regards
> JB
>
> On Aug 21, 2017, 09:01, at 09:01, Pei HE <pei...@gmail.com> wrote:
> >Any updates for upgrading to spark 2.x?
> >
> >I tried to replace the dependency and found a compile error from
> >implementing a scala trait:
> >org.apache.beam.runners.spark.io.SourceRDD.SourcePartition is not
> >abstract
> >and does not override abstract method
> >org$apache$spark$Partition$$super$equals(java.lang.Object) in
> >org.apache.spark.Partition
> >
> >(The spark side change was introduced in
> >https://github.com/apache/spark/pull/12157.)
> >
> >Does anyone have ideas about this compile error?
> >
> >
> >On Wed, May 3, 2017 at 1:32 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> >wrote:
> >
> >> Hi Ted,
> >>
> >> My branch used Spark 2.1.0 and I just updated to 2.1.1.
> >>
> >> As discussed with Aviem, I should be able to create the pull request
> >later
> >> today.
> >>
> >> Regards
> >> JB
> >>
> >>
> >> On 05/03/2017 02:50 AM, Ted Yu wrote:
> >>
> >>> Spark 2.1.1 has been released.
> >>>
> >>> Consider using the new release in this work.
> >>>
> >>> Thanks
> >>>
> >>> On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré
> ><j...@nanthrax.net>
> >>> wrote:
> >>>
> >>> Cool for the PR merge, I will rebase my branch on it.
> >>>>
> >>>> Thanks !
> >>>> Regards
> >>>> JB
> >>>>
> >>>>
> >>>> On 03/29/2017 01:58 PM, Amit Sela wrote:
> >>>>
> >>>> @Ted definitely makes sense.
> >>>>> @JB I'm merging https://github.com/apache/beam/pull/2354 soon so
> >any
> >>>>> deprecated Spark API issues should be resolved.
> >>>>>
> >>>>> On Wed, Mar 29, 2017 at 2:46 PM Ted Yu <yuzhih...@gmail.com>
> >wrote:
> >>>>>
> >>>>> This is what I did over HBASE-16179:
> >>>>>
> >>>>>>
> >>>>>> -        f.call((asJavaIterator(it), conn)).iterator()
> >>>>>> +        // the return type is different in spark 1.x & 2.x, we
> >handle
> >>>>>> both
> >>>>>> cases
> >>>>>> +        f.call(asJavaIterator(it), conn) match {
> >>>>>> +          // spark 1.x
> >>>>>> +          case iterable: Iterable[R] => iterable.iterator()
> >>>>>> +          // spark 2.x
> >>>>>> +          case iterator: Iterator[R] => iterator
> >>>>>> +        }
> >>>>>>        )
> >>>>>>
> >>>>>> FYI
> >>>>>>
> >>>>>> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela <amitsel...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Just tried to replace dependencies and see what happens:
> >>>>>>
> >>>>>>>
> >>>>>>> Most required changes are about the runner using deprecated
> >Spark
> >>>>>>> APIs,
> >>>>>>>
> >>>>>>> and
> >>>>>>
> >>>>>> after fixing them the only real issue is with the Java API for
> >>>>>>> Pair/FlatMapFunction that changed return value to Iterator (in
> >1.6 its
> >>>>>>> Iterable).
> >>>>>>>
> >>>>>>> So I'm not sure that a profile that simply sets dependency on
> >>>>>>> 1.6.3/2.1.0
> >>>>>>> is feasible.
> >>>>>>>
> >>>>>>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant
> ><kobi.sal...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> So, if everything is in place in Spark 2.X and we use provided
> >>>>>>>
> >>>>>>>>
> >>>>>>>> dependencies
> >>>>>>>
> >>>>>>> for Spark in Beam.
> >>>>>>>> Theoretically, you can run the same code in 2.X without any
> >need for
> >>>>>>>> a
> >>>>>>>> branch?
> >>>>>>>>
> >>>>>>>> 2017-03-23 9:47 GMT+02:00 Amit Sela <amitsel...@gmail.com>:
> >>>>>>>>
> >>>>>>>> If StreamingContext is valid and we don't have to use
> >SparkSession,
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> and
> >>>>>>>>
> >>>>>>>
> >>>>>> Accumulators are valid as well and we don't need AccumulatorsV2,
> >I
> >>>>>>>
> >>>>>>>>
> >>>>>>>>> don't
> >>>>>>>>
> >>>>>>>
> >>>>>>> see a reason this shouldn't work (which means there are still
> >tons of
> >>>>>>>>
> >>>>>>>>> reasons this could break, but I can't think of them off the
> >top of
> >>>>>>>>> my
> >>>>>>>>>
> >>>>>>>>> head
> >>>>>>>>
> >>>>>>>> right now).
> >>>>>>>>>
> >>>>>>>>> @JB simply add a profile for the Spark dependencies and run
> >the
> >>>>>>>>>
> >>>>>>>>> tests -
> >>>>>>>>
> >>>>>>>
> >>>>>> you'll have a very definitive answer ;-) .
> >>>>>>>
> >>>>>>>> If this passes, try on a cluster running Spark 2 as well.
> >>>>>>>>>
> >>>>>>>>> Let me know of I can assist.
> >>>>>>>>>
> >>>>>>>>> On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
> >>>>>>>>>
> >>>>>>>>> j...@nanthrax.net>
> >>>>>>>>
> >>>>>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>>> Hi guys,
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Ismaël summarize well what I have in mind.
> >>>>>>>>>>
> >>>>>>>>>> I'm a bit late on the PoC around that (I started a branch
> >already).
> >>>>>>>>>> I will move forward over the week end.
> >>>>>>>>>>
> >>>>>>>>>> Regards
> >>>>>>>>>> JB
> >>>>>>>>>>
> >>>>>>>>>> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
> >>>>>>>>>>
> >>>>>>>>>> Amit, I suppose JB is talking about the RDD based version, so
> >no
> >>>>>>>>>>>
> >>>>>>>>>>> need
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>> to worry about SparkSession or different incompatible APIs.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>> Remember the idea we are discussing is to have in master
> >both the
> >>>>>>>>>>> spark 1 and spark 2 runners using the RDD based translation.
> >At
> >>>>>>>>>>>
> >>>>>>>>>>> the
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>> same time we can have a feature branch to evolve the DataSet
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>> based
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>> translator (this one will replace the RDD based translator for
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>> spark
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>> 2
> >>>>>>>>
> >>>>>>>> once it is mature).
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> The advantages have been already discussed as well as the
> >>>>>>>>>>>
> >>>>>>>>>>> possible
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>> issues so I think we have to see now if JB's idea is feasible and
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>> how
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>> hard would be to live with this while the DataSet version
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>> evolves.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>>> I think what we are trying to avoid is to have a long living
> >>>>>>>>>>>
> >>>>>>>>>>> branch
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>> for a spark 2 runner based on RDD  because the maintenance burden
> >>>>>>>
> >>>>>>>> would be even worse. We would have to fight not only with the
> >>>>>>>>>>>
> >>>>>>>>>>> double
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>> merge of fixes (in case the profile idea does not work), but
> >also
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>> with
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>> the continue evolution of Beam and we would end up in the long
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> living
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>> branch mess that others runners have dealt with (e.g. the Apex
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>> runner)
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b893
> >>>>>>>>>> 22ce541
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>> 6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> What do you think about this Amit ? Would you be ok to go
> >with it
> >>>>>>>>>>>
> >>>>>>>>>>> if
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>> JB's profile idea proves to help with the msintenance issues ?
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>> Ismaël
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu
> ><yuzhih...@gmail.com>
> >>>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>> hbase-spark module doesn't use SparkSession. So situation there
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>> is
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>> simpler
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>> :-)
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela <
> >>>>>>>>>>>>
> >>>>>>>>>>>> amitsel...@gmail.com>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> I'm still wondering how we'll do this - it's not just
> >different
> >>>>>>>>>>>>
> >>>>>>>>>>>>> implementations of the same Class, but a completely
> >different
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> concepts
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>> such
> >>>>>>>>>>
> >>>>>>>>>> as using SparkSession in Spark 2 instead of
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> SparkContext/StreamingContext
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> in Spark 1.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu
> ><yuzhih...@gmail.com>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>>>> I have done some work over in HBASE-16179 where compatibility
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> modules
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>> are
> >>>>>>>>>>
> >>>>>>>>>> created to isolate changes in Spark 2.x API so that code in
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> hbase-spark
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>> module can be reused.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> FYI
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>> Jean-Baptiste Onofré
> >>>>>>>>>> jbono...@apache.org
> >>>>>>>>>> http://blog.nanthrax.net
> >>>>>>>>>> Talend - http://www.talend.com
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>> --
> >>>> Jean-Baptiste Onofré
> >>>> jbono...@apache.org
> >>>> http://blog.nanthrax.net
> >>>> Talend - http://www.talend.com
> >>>>
> >>>>
> >>>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >>
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Beam spark 2.x runner status

Reply via email to