Re: Beam spark 2.x runner status

Ted Yu Tue, 02 May 2017 17:50:48 -0700

Spark 2.1.1 has been released.

Consider using the new release in this work.


Thanks

On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Cool for the PR merge, I will rebase my branch on it.
>
> Thanks !
> Regards
> JB
>
>
> On 03/29/2017 01:58 PM, Amit Sela wrote:
>
>> @Ted definitely makes sense.
>> @JB I'm merging https://github.com/apache/beam/pull/2354 soon so any
>> deprecated Spark API issues should be resolved.
>>
>> On Wed, Mar 29, 2017 at 2:46 PM Ted Yu <yuzhih...@gmail.com> wrote:
>>
>> This is what I did over HBASE-16179:
>>>
>>> -        f.call((asJavaIterator(it), conn)).iterator()
>>> +        // the return type is different in spark 1.x & 2.x, we handle
>>> both
>>> cases
>>> +        f.call(asJavaIterator(it), conn) match {
>>> +          // spark 1.x
>>> +          case iterable: Iterable[R] => iterable.iterator()
>>> +          // spark 2.x
>>> +          case iterator: Iterator[R] => iterator
>>> +        }
>>>        )
>>>
>>> FYI
>>>
>>> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela <amitsel...@gmail.com> wrote:
>>>
>>> Just tried to replace dependencies and see what happens:
>>>>
>>>> Most required changes are about the runner using deprecated Spark APIs,
>>>>
>>> and
>>>
>>>> after fixing them the only real issue is with the Java API for
>>>> Pair/FlatMapFunction that changed return value to Iterator (in 1.6 its
>>>> Iterable).
>>>>
>>>> So I'm not sure that a profile that simply sets dependency on
>>>> 1.6.3/2.1.0
>>>> is feasible.
>>>>
>>>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant <kobi.sal...@gmail.com>
>>>> wrote:
>>>>
>>>> So, if everything is in place in Spark 2.X and we use provided
>>>>>
>>>> dependencies
>>>>
>>>>> for Spark in Beam.
>>>>> Theoretically, you can run the same code in 2.X without any need for a
>>>>> branch?
>>>>>
>>>>> 2017-03-23 9:47 GMT+02:00 Amit Sela <amitsel...@gmail.com>:
>>>>>
>>>>> If StreamingContext is valid and we don't have to use SparkSession,
>>>>>>
>>>>> and
>>>
>>>> Accumulators are valid as well and we don't need AccumulatorsV2, I
>>>>>>
>>>>> don't
>>>>
>>>>> see a reason this shouldn't work (which means there are still tons of
>>>>>> reasons this could break, but I can't think of them off the top of my
>>>>>>
>>>>> head
>>>>>
>>>>>> right now).
>>>>>>
>>>>>> @JB simply add a profile for the Spark dependencies and run the
>>>>>>
>>>>> tests -
>>>
>>>> you'll have a very definitive answer ;-) .
>>>>>> If this passes, try on a cluster running Spark 2 as well.
>>>>>>
>>>>>> Let me know of I can assist.
>>>>>>
>>>>>> On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
>>>>>>
>>>>> j...@nanthrax.net>
>>>
>>>> wrote:
>>>>>>
>>>>>> Hi guys,
>>>>>>>
>>>>>>> Ismaël summarize well what I have in mind.
>>>>>>>
>>>>>>> I'm a bit late on the PoC around that (I started a branch already).
>>>>>>> I will move forward over the week end.
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>> On 03/22/2017 11:42 PM, Ismaël Mejía wrote:
>>>>>>>
>>>>>>>> Amit, I suppose JB is talking about the RDD based version, so no
>>>>>>>>
>>>>>>> need
>>>>
>>>>> to worry about SparkSession or different incompatible APIs.
>>>>>>>>
>>>>>>>> Remember the idea we are discussing is to have in master both the
>>>>>>>> spark 1 and spark 2 runners using the RDD based translation. At
>>>>>>>>
>>>>>>> the
>>>
>>>> same time we can have a feature branch to evolve the DataSet
>>>>>>>>
>>>>>>> based
>>>
>>>> translator (this one will replace the RDD based translator for
>>>>>>>>
>>>>>>> spark
>>>>
>>>>> 2
>>>>>
>>>>>> once it is mature).
>>>>>>>>
>>>>>>>> The advantages have been already discussed as well as the
>>>>>>>>
>>>>>>> possible
>>>
>>>> issues so I think we have to see now if JB's idea is feasible and
>>>>>>>>
>>>>>>> how
>>>>
>>>>> hard would be to live with this while the DataSet version
>>>>>>>>
>>>>>>> evolves.
>>>
>>>>
>>>>>>>> I think what we are trying to avoid is to have a long living
>>>>>>>>
>>>>>>> branch
>>>
>>>> for a spark 2 runner based on RDD  because the maintenance burden
>>>>>>>> would be even worse. We would have to fight not only with the
>>>>>>>>
>>>>>>> double
>>>>
>>>>> merge of fixes (in case the profile idea does not work), but also
>>>>>>>>
>>>>>>> with
>>>>>
>>>>>> the continue evolution of Beam and we would end up in the long
>>>>>>>>
>>>>>>> living
>>>>
>>>>> branch mess that others runners have dealt with (e.g. the Apex
>>>>>>>>
>>>>>>> runner)
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> https://lists.apache.org/thread.html/12cc086f5ffe331cc70b89322ce541
>>>
>>>> 6c3112b87efc3393e3e16032a2@%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>>>
>>>>>>>> What do you think about this Amit ? Would you be ok to go with it
>>>>>>>>
>>>>>>> if
>>>>
>>>>> JB's profile idea proves to help with the msintenance issues ?
>>>>>>>>
>>>>>>>> Ismaël
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 22, 2017 at 5:53 PM, Ted Yu <yuzhih...@gmail.com>
>>>>>>>>
>>>>>>> wrote:
>>>>
>>>>> hbase-spark module doesn't use SparkSession. So situation there
>>>>>>>>>
>>>>>>>> is
>>>
>>>> simpler
>>>>>>>
>>>>>>>> :-)
>>>>>>>>>
>>>>>>>>> On Wed, Mar 22, 2017 at 5:35 AM, Amit Sela <
>>>>>>>>>
>>>>>>>> amitsel...@gmail.com>
>>>
>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>> I'm still wondering how we'll do this - it's not just different
>>>>>>>>>> implementations of the same Class, but a completely different
>>>>>>>>>>
>>>>>>>>> concepts
>>>>>>
>>>>>>> such
>>>>>>>
>>>>>>>> as using SparkSession in Spark 2 instead of
>>>>>>>>>>
>>>>>>>>> SparkContext/StreamingContext
>>>>>>>
>>>>>>>> in Spark 1.
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 21, 2017 at 7:25 PM Ted Yu <yuzhih...@gmail.com>
>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>
>>>>>>
>>>>>>>>>> I have done some work over in HBASE-16179 where compatibility
>>>>>>>>>>>
>>>>>>>>>> modules
>>>>>>
>>>>>>> are
>>>>>>>
>>>>>>>> created to isolate changes in Spark 2.x API so that code in
>>>>>>>>>>>
>>>>>>>>>> hbase-spark
>>>>>>>
>>>>>>>> module can be reused.
>>>>>>>>>>>
>>>>>>>>>>> FYI
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> --
>>>>>>> Jean-Baptiste Onofré
>>>>>>> jbono...@apache.org
>>>>>>> http://blog.nanthrax.net
>>>>>>> Talend - http://www.talend.com
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Beam spark 2.x runner status

Reply via email to