The CombineFn API has three types parameters (input, accumulator, and output) and methods that approximately correspond to those parts of the collector
CombineFn.createAccumulator = supplier CombineFn.addInput = accumulator CombineFn.mergeAccumlator = combiner CombineFn.extractOutput = finisher That said, the Collector API has some minimal, cosmetic differences, such as CombineFn.addInput may either mutate the accumulator or return it. The Collector accumulator method is a BiConsumer, meaning it must modify. On Tue, Mar 13, 2018 at 11:39 AM Romain Manni-Bucau <rmannibu...@gmail.com> wrote: > Misses the collect split in 3 (supplier, combiner, aggregator) but > globally agree. > > I d just take java stream, remove "client" method or make them big data > if possible, ensure all hooks are serializable to avoid hacks and add an > unwrap to be able to access the pipeline in case we need a very custom > thing and we are done for me. > > Le 13 mars 2018 19:26, "Ben Chambers" <bchamb...@apache.org> a écrit : > >> I think the existing rationale (not introducing lots of special fluent >> methods) makes sense. However, if we look at the Java Stream API, we >> probably wouldn't need to introduce *a lot* of fluent builders to get most >> of the functionality. Specifically, if we focus on map, flatMap, and >> collect from the Stream API, and a few extensions, we get something like: >> >> * collection.map(DoFn) for applying aParDo >> * collection.map(SerialiazableFn) for Java8 lambda shorthand >> * collection.flatMap(SerialiazbleFn) for Java8 lambda shorthand >> * collection.collect(CombineFn) for applying a CombineFn >> * collection.apply(PTransform) for applying a composite transform. note >> that PTransforms could also use serializable lambdas for definition. >> >> (Note that GroupByKey doesn't even show up here -- it could, but that >> could also be way of wrapping a collector, as in the Java8 >> Collectors.groupyingBy [1] >> >> With this, we could write code like: >> >> collection >> .map(myDoFn) >> .map((s) -> s.toString()) >> .collect(new IntegerCombineFn()) >> .apply(GroupByKey.of()); >> >> That said, my two concerns are: >> (1) having two similar but different Java APIs. If we have more idiomatic >> way of writing pipelines in Java, we should make that the standard. >> Otherwise, users will be confused by seeing "Beam" examples written in >> multiple, incompatible syntaxes. >> (2) making sure the above is truly idiomatic Java and that it doesn't any >> conflicts with the cross-language Beam programming model. I don't think it >> does. We have (I believ) chosen to make the Python and Go SDKs idiomatic >> for those languages where possible. >> >> If this work is focused on making the Java SDK more idiomatic (and thus >> easier for Java users to learn), it seems like a good thing. We should just >> make sure it doesn't scope-creep into defining an entirely new DSL or SDK. >> >> [1] >> https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function- >> >> On Tue, Mar 13, 2018 at 11:06 AM Romain Manni-Bucau < >> rmannibu...@gmail.com> wrote: >> >>> Yep >>> >>> I know the rational and it makes sense but it also increases the >>> entering steps for users and is not that smooth in ides, in particular for >>> custom code. >>> >>> So I really think it makes sense to build an user friendly api on top of >>> beam core dev one. >>> >>> >>> Le 13 mars 2018 18:35, "Aljoscha Krettek" <aljos...@apache.org> a >>> écrit : >>> >>>> >>>> https://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html >>>> >>>> On 11. Mar 2018, at 22:21, Romain Manni-Bucau <rmannibu...@gmail.com> >>>> wrote: >>>> >>>> >>>> >>>> Le 12 mars 2018 00:16, "Reuven Lax" <re...@google.com> a écrit : >>>> >>>> I think it would be interesting to see what a Java stream-based API >>>> would look like. As I mentioned elsewhere, we are not limited to having >>>> only one API for Beam. >>>> >>>> If I remember correctly, a Java stream API was considered for Dataflow >>>> back at the very beginning. I don't completely remember why it was >>>> rejected, but I suspect at least part of the reason might have been that >>>> Java streams were considered too new and untested back then. >>>> >>>> >>>> Coders are broken - typevariables dont have bounds except object - and >>>> reducers are not trivial to impl generally I guess. >>>> >>>> However being close of this api can help a lot so +1 to try to have a >>>> java dsl on top of current api. Would also be neat to integrate it with >>>> completionstage :). >>>> >>>> >>>> >>>> Reuven >>>> >>>> >>>> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau < >>>> rmannibu...@gmail.com> wrote: >>>> >>>>> >>>>> >>>>> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <j...@nanthrax.net> a >>>>> écrit : >>>>> >>>>> Hi Romain, >>>>> >>>>> I remember we have discussed about the way to express pipeline while >>>>> ago. >>>>> >>>>> I was fan of a "DSL" compared to the one we have in Camel: instead of >>>>> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR, >>>>> it's the approach in flume). >>>>> However, we agreed that apply() syntax gives a more flexible approach. >>>>> >>>>> Using Java Stream is interesting but I'm afraid we would have the same >>>>> issue as the one we identified discussing "fluent Java SDK". However, we >>>>> can have a Stream API DSL on top of the SDK IMHO. >>>>> >>>>> >>>>> Agree and a beam stream interface (copying jdk api but making lambda >>>>> serializable to avoid the cast need). >>>>> >>>>> On my side i think it enables user to discover the api. If you check >>>>> my poc impl you quickly see the steps needed to do simple things like a >>>>> map >>>>> which is a first citizen. >>>>> >>>>> Also curious if we could impl reduce with pipeline result = get an >>>>> output of a batch from the runner (client) jvm. I see how to do it for >>>>> longs - with metrics - but not for collect(). >>>>> >>>>> >>>>> Regards >>>>> JB >>>>> >>>>> >>>>> On 11/03/2018 19:46, Romain Manni-Bucau wrote: >>>>> >>>>>> Hi guys, >>>>>> >>>>>> don't know if you already experienced using java Stream API as a >>>>>> replacement for pipeline API but did some tests: >>>>>> https://github.com/rmannibucau/jbeam >>>>>> >>>>>> It is far to be complete but already shows where it fails (beam >>>>>> doesn't have a way to reduce in the caller machine for instance, coder >>>>>> handling is not that trivial, lambda are not working well with default >>>>>> Stream API etc...). >>>>>> >>>>>> However it is interesting to see that having such an API is pretty >>>>>> natural compare to the pipeline API >>>>>> so wonder if beam should work on its own Stream API (with surely >>>>>> another name for obvious reasons ;)). >>>>>> >>>>>> Romain Manni-Bucau >>>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog < >>>>>> https://rmannibucau.metawerx.net/> | Old Blog < >>>>>> http://rmannibucau.wordpress.com> | Github < >>>>>> https://github.com/rmannibucau> | LinkedIn < >>>>>> https://www.linkedin.com/in/rmannibucau> | Book < >>>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance >>>>>> > >>>>>> >>>>> >>>>> >>>> >>>>