Re: (java) stream & beam?

Romain Manni-Bucau Tue, 13 Mar 2018 12:30:37 -0700

Yep, while we can pass lambdas i guess it is fine or we have to use proxies
to hide the mutation but i dont think we need to be that purist to move to
a more expressive dsl.



Le 13 mars 2018 19:49, "Ben Chambers" <bjchamb...@gmail.com> a écrit :

> The CombineFn API has three types parameters (input, accumulator, and
> output) and methods that approximately correspond to those parts of the
> collector
>
> CombineFn.createAccumulator = supplier
> CombineFn.addInput = accumulator
> CombineFn.mergeAccumlator = combiner
> CombineFn.extractOutput = finisher
>
> That said, the Collector API has some minimal, cosmetic differences, such
> as CombineFn.addInput may either mutate the accumulator or return it. The
> Collector accumulator method is a BiConsumer, meaning it must modify.
>
> On Tue, Mar 13, 2018 at 11:39 AM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> Misses the collect split in 3 (supplier, combiner, aggregator) but
>> globally agree.
>>
>>  I d just take java stream, remove "client" method or make them big data
>> if possible, ensure all hooks are serializable to avoid hacks and add an
>> unwrap to be able to access the pipeline in case we need a very custom
>> thing and we are done for me.
>>
>> Le 13 mars 2018 19:26, "Ben Chambers" <bchamb...@apache.org> a écrit :
>>
>>> I think the existing rationale (not introducing lots of special fluent
>>> methods) makes sense. However, if we look at the Java Stream API, we
>>> probably wouldn't need to introduce *a lot* of fluent builders to get most
>>> of the functionality. Specifically, if we focus on map, flatMap, and
>>> collect from the Stream API, and a few extensions, we get something like:
>>>
>>> * collection.map(DoFn) for applying aParDo
>>> * collection.map(SerialiazableFn) for Java8 lambda shorthand
>>> * collection.flatMap(SerialiazbleFn) for Java8 lambda shorthand
>>> * collection.collect(CombineFn) for applying a CombineFn
>>> * collection.apply(PTransform) for applying a composite transform. note
>>> that PTransforms could also use serializable lambdas for definition.
>>>
>>> (Note that GroupByKey doesn't even show up here -- it could, but that
>>> could also be way of wrapping a collector, as in the Java8
>>> Collectors.groupyingBy [1]
>>>
>>> With this, we could write code like:
>>>
>>> collection
>>>   .map(myDoFn)
>>>   .map((s) -> s.toString())
>>>   .collect(new IntegerCombineFn())
>>>   .apply(GroupByKey.of());
>>>
>>> That said, my two concerns are:
>>> (1) having two similar but different Java APIs. If we have more
>>> idiomatic way of writing pipelines in Java, we should make that the
>>> standard. Otherwise, users will be confused by seeing "Beam" examples
>>> written in multiple, incompatible syntaxes.
>>> (2) making sure the above is truly idiomatic Java and that it doesn't
>>> any conflicts with the cross-language Beam programming model. I don't think
>>> it does. We have (I believ) chosen to make the Python and Go SDKs idiomatic
>>> for those languages where possible.
>>>
>>> If this work is focused on making the Java SDK more idiomatic (and thus
>>> easier for Java users to learn), it seems like a good thing. We should just
>>> make sure it doesn't scope-creep into defining an entirely new DSL or SDK.
>>>
>>> [1] https://docs.oracle.com/javase/8/docs/api/java/util/
>>> stream/Collectors.html#groupingBy-java.util.function.Function-
>>>
>>> On Tue, Mar 13, 2018 at 11:06 AM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>> Yep
>>>>
>>>> I know the rational and it makes sense but it also increases the
>>>> entering steps for users and is not that smooth in ides, in particular for
>>>> custom code.
>>>>
>>>> So I really think it makes sense to build an user friendly api on top
>>>> of beam core dev one.
>>>>
>>>>
>>>> Le 13 mars 2018 18:35, "Aljoscha Krettek" <aljos...@apache.org> a
>>>> écrit :
>>>>
>>>>> https://beam.apache.org/blog/2016/05/27/where-is-my-
>>>>> pcollection-dot-map.html
>>>>>
>>>>> On 11. Mar 2018, at 22:21, Romain Manni-Bucau <rmannibu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> Le 12 mars 2018 00:16, "Reuven Lax" <re...@google.com> a écrit :
>>>>>
>>>>> I think it would be interesting to see what a Java stream-based API
>>>>> would look like. As I mentioned elsewhere, we are not limited to having
>>>>> only one API for Beam.
>>>>>
>>>>> If I remember correctly, a Java stream API was considered for Dataflow
>>>>> back at the very beginning. I don't completely remember why it was
>>>>> rejected, but I suspect at least part of the reason might have been that
>>>>> Java streams were considered too new and untested back then.
>>>>>
>>>>>
>>>>> Coders are broken - typevariables dont have bounds except object - and
>>>>> reducers are not trivial to impl generally I guess.
>>>>>
>>>>> However being close of this api can help a lot so +1 to try to have a
>>>>> java dsl on top of current api. Would also be neat to integrate it with
>>>>> completionstage :).
>>>>>
>>>>>
>>>>>
>>>>> Reuven
>>>>>
>>>>>
>>>>> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <
>>>>> rmannibu...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <j...@nanthrax.net> a
>>>>>> écrit :
>>>>>>
>>>>>> Hi Romain,
>>>>>>
>>>>>> I remember we have discussed about the way to express pipeline while
>>>>>> ago.
>>>>>>
>>>>>> I was fan of a "DSL" compared to the one we have in Camel: instead of
>>>>>> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
>>>>>> it's the approach in flume).
>>>>>> However, we agreed that apply() syntax gives a more flexible approach.
>>>>>>
>>>>>> Using Java Stream is interesting but I'm afraid we would have the
>>>>>> same issue as the one we identified discussing "fluent Java SDK". 
>>>>>> However,
>>>>>> we can have a Stream API DSL on top of the SDK IMHO.
>>>>>>
>>>>>>
>>>>>> Agree and a beam stream interface (copying jdk api but making lambda
>>>>>> serializable to avoid the cast need).
>>>>>>
>>>>>> On my side i think it enables user to discover the api. If you check
>>>>>> my poc impl you quickly see the steps needed to do simple things like a 
>>>>>> map
>>>>>> which is a first citizen.
>>>>>>
>>>>>> Also curious if we could impl reduce with pipeline result = get an
>>>>>> output of a batch from the runner (client) jvm. I see how to do it for
>>>>>> longs - with metrics - but not for collect().
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>>
>>>>>> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> don't know if you already experienced using java Stream API as a
>>>>>>> replacement for pipeline API but did some tests: https://github.com/
>>>>>>> rmannibucau/jbeam
>>>>>>>
>>>>>>> It is far to be complete but already shows where it fails (beam
>>>>>>> doesn't have a way to reduce in the caller machine for instance, coder
>>>>>>> handling is not that trivial, lambda are not working well with default
>>>>>>> Stream API etc...).
>>>>>>>
>>>>>>> However it is interesting to see that having such an API is pretty
>>>>>>> natural compare to the pipeline API
>>>>>>> so wonder if beam should work on its own Stream API (with surely
>>>>>>> another name for obvious reasons ;)).
>>>>>>>
>>>>>>> Romain Manni-Bucau
>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>>>>>>> https://rmannibucau.metawerx.net/> | Old Blog <
>>>>>>> http://rmannibucau.wordpress.com> | Github <https://github.com/
>>>>>>> rmannibucau> | LinkedIn <https://www.linkedin.com/in/rmannibucau> |
>>>>>>> Book <https://www.packtpub.com/application-development/java-
>>>>>>> ee-8-high-performance>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>

Re: (java) stream & beam?

Reply via email to