Re: Beam spark 2.x runner status

Jean-Baptiste Onofré Mon, 21 Aug 2017 11:33:45 -0700

Hi

I did a new runner supporting spark 2.1.x. I changed code for that.


I'm still in vacation this week. I will send an update when back.

Regards
JB

On Aug 21, 2017, 09:01, at 09:01, Pei HE <pei...@gmail.com> wrote:
>Any updates for upgrading to spark 2.x?
>
>I tried to replace the dependency and found a compile error from
>implementing a scala trait:
>org.apache.beam.runners.spark.io.SourceRDD.SourcePartition is not
>abstract
>and does not override abstract method
>org$apache$spark$Partition$$super$equals(java.lang.Object) in
>org.apache.spark.Partition
>
>(The spark side change was introduced in
>https://github.com/apache/spark/pull/12157.)
>
>Does anyone have ideas about this compile error?
>
>
>On Wed, May 3, 2017 at 1:32 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
>wrote:
>
>> Hi Ted,
>>
>> My branch used Spark 2.1.0 and I just updated to 2.1.1.
>>
>> As discussed with Aviem, I should be able to create the pull request
>later
>> today.
>>
>> Regards
>> JB
>>
>>
>> On 05/03/2017 02:50 AM, Ted Yu wrote:
>>
>>> Spark 2.1.1 has been released.
>>>
>>> Consider using the new release in this work.
>>>
>>> Thanks
>>>
>>> On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré
><j...@nanthrax.net>
>>> wrote:
>>>
>>> Cool for the PR merge, I will rebase my branch on it.
>>>>
>>>> Thanks !
>>>> Regards
>>>> JB
>>>>
>>>>
>>>> On 03/29/2017 01:58 PM, Amit Sela wrote:
>>>>
>>>> @Ted definitely makes sense.
>>>>> @JB I'm merging https://github.com/apache/beam/pull/2354 soon so
>any
>>>>> deprecated Spark API issues should be resolved.
>>>>>
>>>>> On Wed, Mar 29, 2017 at 2:46 PM Ted Yu <yuzhih...@gmail.com>
>wrote:
>>>>>
>>>>> This is what I did over HBASE-16179:
>>>>>
>>>>>>
>>>>>> -        f.call((asJavaIterator(it), conn)).iterator()
>>>>>> +        // the return type is different in spark 1.x & 2.x, we
>handle
>>>>>> both
>>>>>> cases
>>>>>> +        f.call(asJavaIterator(it), conn) match {
>>>>>> +          // spark 1.x
>>>>>> +          case iterable: Iterable[R] => iterable.iterator()
>>>>>> +          // spark 2.x
>>>>>> +          case iterator: Iterator[R] => iterator
>>>>>> +        }
>>>>>>        )
>>>>>>
>>>>>> FYI
>>>>>>
>>>>>> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela <amitsel...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Just tried to replace dependencies and see what happens:
>>>>>>
>>>>>>>
>>>>>>> Most required changes are about the runner using deprecated
>Spark
>>>>>>> APIs,
>>>>>>>
>>>>>>> and
>>>>>>
>>>>>> after fixing them the only real issue is with the Java API for
>>>>>>> Pair/FlatMapFunction that changed return value to Iterator (in
>1.6 its
>>>>>>> Iterable).
>>>>>>>
>>>>>>> So I'm not sure that a profile that simply sets dependency on
>>>>>>> 1.6.3/2.1.0
>>>>>>> is feasible.
>>>>>>>
>>>>>>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant
><kobi.sal...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> So, if everything is in place in Spark 2.X and we use provided
>>>>>>>
>>>>>>>>
>>>>>>>> dependencies
>>>>>>>
>>>>>>> for Spark in Beam.
>>>>>>>> Theoretically, you can run the same code in 2.X without any
>need for
>>>>>>>> a
>>>>>>>> branch?
>>>>>>>>
>>>>>>>> 2017-03-23 9:47 GMT+02:00 Amit Sela <amitsel...@gmail.com>:
>>>>>>>>
>>>>>>>> If StreamingContext is valid and we don't have to use
>SparkSession,
>>>>>>>>
>>>>>>>>>
>>>>>>>>> and
>>>>>>>>
>>>>>>>
>>>>>> Accumulators are valid as well and we don't need AccumulatorsV2,
>I
>>>>>>>
>>>>>>>>
>>>>>>>>> don't
>>>>>>>>
>>>>>>>
>>>>>>> see a reason this shouldn't work (which means there are still
>tons of
>>>>>>>>
>>>>>>>>> reasons this could break, but I can't think of them off the
>top of
>>>>>>>>> my
>>>>>>>>>
>>>>>>>>> head
>>>>>>>>
>>>>>>>> right now).
>>>>>>>>>
>>>>>>>>> @JB simply add a profile for the Spark dependencies and run
>the
>>>>>>>>>
>>>>>>>>> tests -
>>>>>>>>
>>>>>>>
>>>>>> you'll have a very definitive answer ;-) .
>>>>>>>
>>>>>>>> If this passes, try on a cluster running Spark 2 as well.
>>>>>>>>>
>>>>>>>>> Let me know of I can assist.
>>>>>>>>>
>>>>>>>>> On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
>>>>>>>>>
>>>>>>>>> j...@nanthrax.net>
>>>>>>>>
>>>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Ismaël summarize well what I have in mind.
>>>>>>>>>>
>>>>>>>>>> I'm a bit late on the PoC around that (I started a branch
>already).
>>>>>>>>>> I will move forward over the week end.
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> JB
>>>>>>>>>>
>>>>>>>>>> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
>>>>>>>>>>
>>>>>>>>>> Amit, I suppose JB is talking about the RDD based version, so
>no
>>>>>>>>>>>
>>>>>>>>>>> need
>>>>>>>>>>
>>>>>>>>>
>>>>>>> to worry about SparkSession or different incompatible APIs.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Remember the idea we are discussing is to have in master
>both the
>>>>>>>>>>> spark 1 and spark 2 runners using the RDD based translation.
>At
>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>
>>>>>> same time we can have a feature branch to evolve the DataSet
>>>>>>>
>>>>>>>>
>>>>>>>>>>> based
>>>>>>>>>>
>>>>>>>>>
>>>>>> translator (this one will replace the RDD based translator for
>>>>>>>
>>>>>>>>
>>>>>>>>>>> spark
>>>>>>>>>>
>>>>>>>>>
>>>>>>> 2
>>>>>>>>
>>>>>>>> once it is mature).
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> The advantages have been already discussed as well as the
>>>>>>>>>>>
>>>>>>>>>>> possible
>>>>>>>>>>
>>>>>>>>>
>>>>>> issues so I think we have to see now if JB's idea is feasible and
>>>>>>>
>>>>>>>>
>>>>>>>>>>> how
>>>>>>>>>>
>>>>>>>>>
>>>>>>> hard would be to live with this while the DataSet version
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> evolves.
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>>> I think what we are trying to avoid is to have a long living
>>>>>>>>>>>
>>>>>>>>>>> branch
>>>>>>>>>>
>>>>>>>>>
>>>>>> for a spark 2 runner based on RDD  because the maintenance burden
>>>>>>>
>>>>>>>> would be even worse. We would have to fight not only with the
>>>>>>>>>>>
>>>>>>>>>>> double
>>>>>>>>>>
>>>>>>>>>
>>>>>>> merge of fixes (in case the profile idea does not work), but
>also
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> with
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> the continue evolution of Beam and we would end up in the long
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> living
>>>>>>>>>>
>>>>>>>>>
>>>>>>> branch mess that others runners have dealt with (e.g. the Apex
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> runner)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b893
>>>>>>>>>> 22ce541
>>>>>>>>>>
>>>>>>>>>
>>>>>> 6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> What do you think about this Amit ? Would you be ok to go
>with it
>>>>>>>>>>>
>>>>>>>>>>> if
>>>>>>>>>>
>>>>>>>>>
>>>>>>> JB's profile idea proves to help with the msintenance issues ?
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Ismaël
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu
><yuzhih...@gmail.com>
>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>
>>>>>>> hbase-spark module doesn't use SparkSession. So situation there
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>
>>>>>> simpler
>>>>>>>
>>>>>>>>
>>>>>>>>>> :-)
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela <
>>>>>>>>>>>>
>>>>>>>>>>>> amitsel...@gmail.com>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I'm still wondering how we'll do this - it's not just
>different
>>>>>>>>>>>>
>>>>>>>>>>>>> implementations of the same Class, but a completely
>different
>>>>>>>>>>>>>
>>>>>>>>>>>>> concepts
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> such
>>>>>>>>>>
>>>>>>>>>> as using SparkSession in Spark 2 instead of
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> SparkContext/StreamingContext
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> in Spark 1.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu
><yuzhih...@gmail.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>>>>> I have done some work over in HBASE-16179 where compatibility
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> modules
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>> are
>>>>>>>>>>
>>>>>>>>>> created to isolate changes in Spark 2.x API so that code in
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>> hbase-spark
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> module can be reused.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>> FYI
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>> Jean-Baptiste Onofré
>>>>>>>>>> jbono...@apache.org
>>>>>>>>>> http://blog.nanthrax.net
>>>>>>>>>> Talend - http://www.talend.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>> --
>>>> Jean-Baptiste Onofré
>>>> jbono...@apache.org
>>>> http://blog.nanthrax.net
>>>> Talend - http://www.talend.com
>>>>
>>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>

Re: Beam spark 2.x runner status

Reply via email to