Re: (java) stream & beam?

Ben Chambers Tue, 13 Mar 2018 11:49:48 -0700

The CombineFn API has three types parameters (input, accumulator, and
output) and methods that approximately correspond to those parts of the
collector


CombineFn.createAccumulator = supplier
CombineFn.addInput = accumulator
CombineFn.mergeAccumlator = combiner
CombineFn.extractOutput = finisher

That said, the Collector API has some minimal, cosmetic differences, such
as CombineFn.addInput may either mutate the accumulator or return it. The
Collector accumulator method is a BiConsumer, meaning it must modify.

On Tue, Mar 13, 2018 at 11:39 AM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> Misses the collect split in 3 (supplier, combiner, aggregator) but
> globally agree.
>
>  I d just take java stream, remove "client" method or make them big data
> if possible, ensure all hooks are serializable to avoid hacks and add an
> unwrap to be able to access the pipeline in case we need a very custom
> thing and we are done for me.
>
> Le 13 mars 2018 19:26, "Ben Chambers" <bchamb...@apache.org> a écrit :
>
>> I think the existing rationale (not introducing lots of special fluent
>> methods) makes sense. However, if we look at the Java Stream API, we
>> probably wouldn't need to introduce *a lot* of fluent builders to get most
>> of the functionality. Specifically, if we focus on map, flatMap, and
>> collect from the Stream API, and a few extensions, we get something like:
>>
>> * collection.map(DoFn) for applying aParDo
>> * collection.map(SerialiazableFn) for Java8 lambda shorthand
>> * collection.flatMap(SerialiazbleFn) for Java8 lambda shorthand
>> * collection.collect(CombineFn) for applying a CombineFn
>> * collection.apply(PTransform) for applying a composite transform. note
>> that PTransforms could also use serializable lambdas for definition.
>>
>> (Note that GroupByKey doesn't even show up here -- it could, but that
>> could also be way of wrapping a collector, as in the Java8
>> Collectors.groupyingBy [1]
>>
>> With this, we could write code like:
>>
>> collection
>>   .map(myDoFn)
>>   .map((s) -> s.toString())
>>   .collect(new IntegerCombineFn())
>>   .apply(GroupByKey.of());
>>
>> That said, my two concerns are:
>> (1) having two similar but different Java APIs. If we have more idiomatic
>> way of writing pipelines in Java, we should make that the standard.
>> Otherwise, users will be confused by seeing "Beam" examples written in
>> multiple, incompatible syntaxes.
>> (2) making sure the above is truly idiomatic Java and that it doesn't any
>> conflicts with the cross-language Beam programming model. I don't think it
>> does. We have (I believ) chosen to make the Python and Go SDKs idiomatic
>> for those languages where possible.
>>
>> If this work is focused on making the Java SDK more idiomatic (and thus
>> easier for Java users to learn), it seems like a good thing. We should just
>> make sure it doesn't scope-creep into defining an entirely new DSL or SDK.
>>
>> [1]
>> https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function-
>>
>> On Tue, Mar 13, 2018 at 11:06 AM Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> Yep
>>>
>>> I know the rational and it makes sense but it also increases the
>>> entering steps for users and is not that smooth in ides, in particular for
>>> custom code.
>>>
>>> So I really think it makes sense to build an user friendly api on top of
>>> beam core dev one.
>>>
>>>
>>> Le 13 mars 2018 18:35, "Aljoscha Krettek" <aljos...@apache.org> a
>>> écrit :
>>>
>>>>
>>>> https://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html
>>>>
>>>> On 11. Mar 2018, at 22:21, Romain Manni-Bucau <rmannibu...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> Le 12 mars 2018 00:16, "Reuven Lax" <re...@google.com> a écrit :
>>>>
>>>> I think it would be interesting to see what a Java stream-based API
>>>> would look like. As I mentioned elsewhere, we are not limited to having
>>>> only one API for Beam.
>>>>
>>>> If I remember correctly, a Java stream API was considered for Dataflow
>>>> back at the very beginning. I don't completely remember why it was
>>>> rejected, but I suspect at least part of the reason might have been that
>>>> Java streams were considered too new and untested back then.
>>>>
>>>>
>>>> Coders are broken - typevariables dont have bounds except object - and
>>>> reducers are not trivial to impl generally I guess.
>>>>
>>>> However being close of this api can help a lot so +1 to try to have a
>>>> java dsl on top of current api. Would also be neat to integrate it with
>>>> completionstage :).
>>>>
>>>>
>>>>
>>>> Reuven
>>>>
>>>>
>>>> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <
>>>> rmannibu...@gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <j...@nanthrax.net> a
>>>>> écrit :
>>>>>
>>>>> Hi Romain,
>>>>>
>>>>> I remember we have discussed about the way to express pipeline while
>>>>> ago.
>>>>>
>>>>> I was fan of a "DSL" compared to the one we have in Camel: instead of
>>>>> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
>>>>> it's the approach in flume).
>>>>> However, we agreed that apply() syntax gives a more flexible approach.
>>>>>
>>>>> Using Java Stream is interesting but I'm afraid we would have the same
>>>>> issue as the one we identified discussing "fluent Java SDK". However, we
>>>>> can have a Stream API DSL on top of the SDK IMHO.
>>>>>
>>>>>
>>>>> Agree and a beam stream interface (copying jdk api but making lambda
>>>>> serializable to avoid the cast need).
>>>>>
>>>>> On my side i think it enables user to discover the api. If you check
>>>>> my poc impl you quickly see the steps needed to do simple things like a 
>>>>> map
>>>>> which is a first citizen.
>>>>>
>>>>> Also curious if we could impl reduce with pipeline result = get an
>>>>> output of a batch from the runner (client) jvm. I see how to do it for
>>>>> longs - with metrics - but not for collect().
>>>>>
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>>
>>>>> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> don't know if you already experienced using java Stream API as a
>>>>>> replacement for pipeline API but did some tests:
>>>>>> https://github.com/rmannibucau/jbeam
>>>>>>
>>>>>> It is far to be complete but already shows where it fails (beam
>>>>>> doesn't have a way to reduce in the caller machine for instance, coder
>>>>>> handling is not that trivial, lambda are not working well with default
>>>>>> Stream API etc...).
>>>>>>
>>>>>> However it is interesting to see that having such an API is pretty
>>>>>> natural compare to the pipeline API
>>>>>> so wonder if beam should work on its own Stream API (with surely
>>>>>> another name for obvious reasons ;)).
>>>>>>
>>>>>> Romain Manni-Bucau
>>>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>>>>>> https://rmannibucau.metawerx.net/> | Old Blog <
>>>>>> http://rmannibucau.wordpress.com> | Github <
>>>>>> https://github.com/rmannibucau> | LinkedIn <
>>>>>> https://www.linkedin.com/in/rmannibucau> | Book <
>>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>
>>>>

Re: (java) stream & beam?

Reply via email to