I'd love to take a look at the PR when it comes in (<3 BEAM + SPARK :)).
On Mon, Aug 21, 2017 at 11:33 AM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi > > I did a new runner supporting spark 2.1.x. I changed code for that. > > I'm still in vacation this week. I will send an update when back. > > Regards > JB > > On Aug 21, 2017, 09:01, at 09:01, Pei HE <pei...@gmail.com> wrote: > >Any updates for upgrading to spark 2.x? > > > >I tried to replace the dependency and found a compile error from > >implementing a scala trait: > >org.apache.beam.runners.spark.io.SourceRDD.SourcePartition is not > >abstract > >and does not override abstract method > >org$apache$spark$Partition$$super$equals(java.lang.Object) in > >org.apache.spark.Partition > > > >(The spark side change was introduced in > >https://github.com/apache/spark/pull/12157.) > > > >Does anyone have ideas about this compile error? > > > > > >On Wed, May 3, 2017 at 1:32 PM, Jean-Baptiste Onofré <j...@nanthrax.net> > >wrote: > > > >> Hi Ted, > >> > >> My branch used Spark 2.1.0 and I just updated to 2.1.1. > >> > >> As discussed with Aviem, I should be able to create the pull request > >later > >> today. > >> > >> Regards > >> JB > >> > >> > >> On 05/03/2017 02:50 AM, Ted Yu wrote: > >> > >>> Spark 2.1.1 has been released. > >>> > >>> Consider using the new release in this work. > >>> > >>> Thanks > >>> > >>> On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré > ><j...@nanthrax.net> > >>> wrote: > >>> > >>> Cool for the PR merge, I will rebase my branch on it. > >>>> > >>>> Thanks ! > >>>> Regards > >>>> JB > >>>> > >>>> > >>>> On 03/29/2017 01:58 PM, Amit Sela wrote: > >>>> > >>>> @Ted definitely makes sense. > >>>>> @JB I'm merging https://github.com/apache/beam/pull/2354 soon so > >any > >>>>> deprecated Spark API issues should be resolved. > >>>>> > >>>>> On Wed, Mar 29, 2017 at 2:46 PM Ted Yu <yuzhih...@gmail.com> > >wrote: > >>>>> > >>>>> This is what I did over HBASE-16179: > >>>>> > >>>>>> > >>>>>> - f.call((asJavaIterator(it), conn)).iterator() > >>>>>> + // the return type is different in spark 1.x & 2.x, we > >handle > >>>>>> both > >>>>>> cases > >>>>>> + f.call(asJavaIterator(it), conn) match { > >>>>>> + // spark 1.x > >>>>>> + case iterable: Iterable[R] => iterable.iterator() > >>>>>> + // spark 2.x > >>>>>> + case iterator: Iterator[R] => iterator > >>>>>> + } > >>>>>> ) > >>>>>> > >>>>>> FYI > >>>>>> > >>>>>> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela <amitsel...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>> Just tried to replace dependencies and see what happens: > >>>>>> > >>>>>>> > >>>>>>> Most required changes are about the runner using deprecated > >Spark > >>>>>>> APIs, > >>>>>>> > >>>>>>> and > >>>>>> > >>>>>> after fixing them the only real issue is with the Java API for > >>>>>>> Pair/FlatMapFunction that changed return value to Iterator (in > >1.6 its > >>>>>>> Iterable). > >>>>>>> > >>>>>>> So I'm not sure that a profile that simply sets dependency on > >>>>>>> 1.6.3/2.1.0 > >>>>>>> is feasible. > >>>>>>> > >>>>>>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant > ><kobi.sal...@gmail.com> > >>>>>>> wrote: > >>>>>>> > >>>>>>> So, if everything is in place in Spark 2.X and we use provided > >>>>>>> > >>>>>>>> > >>>>>>>> dependencies > >>>>>>> > >>>>>>> for Spark in Beam. > >>>>>>>> Theoretically, you can run the same code in 2.X without any > >need for > >>>>>>>> a > >>>>>>>> branch? > >>>>>>>> > >>>>>>>> 2017-03-23 9:47 GMT+02:00 Amit Sela <amitsel...@gmail.com>: > >>>>>>>> > >>>>>>>> If StreamingContext is valid and we don't have to use > >SparkSession, > >>>>>>>> > >>>>>>>>> > >>>>>>>>> and > >>>>>>>> > >>>>>>> > >>>>>> Accumulators are valid as well and we don't need AccumulatorsV2, > >I > >>>>>>> > >>>>>>>> > >>>>>>>>> don't > >>>>>>>> > >>>>>>> > >>>>>>> see a reason this shouldn't work (which means there are still > >tons of > >>>>>>>> > >>>>>>>>> reasons this could break, but I can't think of them off the > >top of > >>>>>>>>> my > >>>>>>>>> > >>>>>>>>> head > >>>>>>>> > >>>>>>>> right now). > >>>>>>>>> > >>>>>>>>> @JB simply add a profile for the Spark dependencies and run > >the > >>>>>>>>> > >>>>>>>>> tests - > >>>>>>>> > >>>>>>> > >>>>>> you'll have a very definitive answer ;-) . > >>>>>>> > >>>>>>>> If this passes, try on a cluster running Spark 2 as well. > >>>>>>>>> > >>>>>>>>> Let me know of I can assist. > >>>>>>>>> > >>>>>>>>> On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré < > >>>>>>>>> > >>>>>>>>> j...@nanthrax.net> > >>>>>>>> > >>>>>>> > >>>>>> wrote: > >>>>>>> > >>>>>>>> > >>>>>>>>> Hi guys, > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Ismaël summarize well what I have in mind. > >>>>>>>>>> > >>>>>>>>>> I'm a bit late on the PoC around that (I started a branch > >already). > >>>>>>>>>> I will move forward over the week end. > >>>>>>>>>> > >>>>>>>>>> Regards > >>>>>>>>>> JB > >>>>>>>>>> > >>>>>>>>>> On 03/22/2017 11:42 PM, Ismaël Mejía wrote: > >>>>>>>>>> > >>>>>>>>>> Amit, I suppose JB is talking about the RDD based version, so > >no > >>>>>>>>>>> > >>>>>>>>>>> need > >>>>>>>>>> > >>>>>>>>> > >>>>>>> to worry about SparkSession or different incompatible APIs. > >>>>>>>> > >>>>>>>>> > >>>>>>>>>>> Remember the idea we are discussing is to have in master > >both the > >>>>>>>>>>> spark 1 and spark 2 runners using the RDD based translation. > >At > >>>>>>>>>>> > >>>>>>>>>>> the > >>>>>>>>>> > >>>>>>>>> > >>>>>> same time we can have a feature branch to evolve the DataSet > >>>>>>> > >>>>>>>> > >>>>>>>>>>> based > >>>>>>>>>> > >>>>>>>>> > >>>>>> translator (this one will replace the RDD based translator for > >>>>>>> > >>>>>>>> > >>>>>>>>>>> spark > >>>>>>>>>> > >>>>>>>>> > >>>>>>> 2 > >>>>>>>> > >>>>>>>> once it is mature). > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> The advantages have been already discussed as well as the > >>>>>>>>>>> > >>>>>>>>>>> possible > >>>>>>>>>> > >>>>>>>>> > >>>>>> issues so I think we have to see now if JB's idea is feasible and > >>>>>>> > >>>>>>>> > >>>>>>>>>>> how > >>>>>>>>>> > >>>>>>>>> > >>>>>>> hard would be to live with this while the DataSet version > >>>>>>>> > >>>>>>>>> > >>>>>>>>>>> evolves. > >>>>>>>>>> > >>>>>>>>> > >>>>>> > >>>>>>> I think what we are trying to avoid is to have a long living > >>>>>>>>>>> > >>>>>>>>>>> branch > >>>>>>>>>> > >>>>>>>>> > >>>>>> for a spark 2 runner based on RDD because the maintenance burden > >>>>>>> > >>>>>>>> would be even worse. We would have to fight not only with the > >>>>>>>>>>> > >>>>>>>>>>> double > >>>>>>>>>> > >>>>>>>>> > >>>>>>> merge of fixes (in case the profile idea does not work), but > >also > >>>>>>>> > >>>>>>>>> > >>>>>>>>>>> with > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> the continue evolution of Beam and we would end up in the long > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> living > >>>>>>>>>> > >>>>>>>>> > >>>>>>> branch mess that others runners have dealt with (e.g. the Apex > >>>>>>>> > >>>>>>>>> > >>>>>>>>>>> runner) > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b893 > >>>>>>>>>> 22ce541 > >>>>>>>>>> > >>>>>>>>> > >>>>>> 6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E > >>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>>> What do you think about this Amit ? Would you be ok to go > >with it > >>>>>>>>>>> > >>>>>>>>>>> if > >>>>>>>>>> > >>>>>>>>> > >>>>>>> JB's profile idea proves to help with the msintenance issues ? > >>>>>>>> > >>>>>>>>> > >>>>>>>>>>> Ismaël > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu > ><yuzhih...@gmail.com> > >>>>>>>>>>> > >>>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>> > >>>>>>> hbase-spark module doesn't use SparkSession. So situation there > >>>>>>>> > >>>>>>>>> > >>>>>>>>>>>> is > >>>>>>>>>>> > >>>>>>>>>> > >>>>>> simpler > >>>>>>> > >>>>>>>> > >>>>>>>>>> :-) > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela < > >>>>>>>>>>>> > >>>>>>>>>>>> amitsel...@gmail.com> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>> wrote: > >>>>>>> > >>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> I'm still wondering how we'll do this - it's not just > >different > >>>>>>>>>>>> > >>>>>>>>>>>>> implementations of the same Class, but a completely > >different > >>>>>>>>>>>>> > >>>>>>>>>>>>> concepts > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> such > >>>>>>>>>> > >>>>>>>>>> as using SparkSession in Spark 2 instead of > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> SparkContext/StreamingContext > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> in Spark 1. > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu > ><yuzhih...@gmail.com> > >>>>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>> > >>>>>>>>> I have done some work over in HBASE-16179 where compatibility > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> modules > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>> are > >>>>>>>>>> > >>>>>>>>>> created to isolate changes in Spark 2.x API so that code in > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>>> hbase-spark > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> module can be reused. > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>>> FYI > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> -- > >>>>>>>>>> Jean-Baptiste Onofré > >>>>>>>>>> jbono...@apache.org > >>>>>>>>>> http://blog.nanthrax.net > >>>>>>>>>> Talend - http://www.talend.com > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> -- > >>>> Jean-Baptiste Onofré > >>>> jbono...@apache.org > >>>> http://blog.nanthrax.net > >>>> Talend - http://www.talend.com > >>>> > >>>> > >>> > >> -- > >> Jean-Baptiste Onofré > >> jbono...@apache.org > >> http://blog.nanthrax.net > >> Talend - http://www.talend.com > >> > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau