Re: (java) stream & beam?

Ben Chambers Tue, 13 Mar 2018 11:26:54 -0700

I think the existing rationale (not introducing lots of special fluent
methods) makes sense. However, if we look at the Java Stream API, we
probably wouldn't need to introduce *a lot* of fluent builders to get most
of the functionality. Specifically, if we focus on map, flatMap, and
collect from the Stream API, and a few extensions, we get something like:


* collection.map(DoFn) for applying aParDo
* collection.map(SerialiazableFn) for Java8 lambda shorthand
* collection.flatMap(SerialiazbleFn) for Java8 lambda shorthand
* collection.collect(CombineFn) for applying a CombineFn
* collection.apply(PTransform) for applying a composite transform. note
that PTransforms could also use serializable lambdas for definition.

(Note that GroupByKey doesn't even show up here -- it could, but that could
also be way of wrapping a collector, as in the Java8 Collectors.groupyingBy
[1]

With this, we could write code like:

collection
  .map(myDoFn)
  .map((s) -> s.toString())
  .collect(new IntegerCombineFn())
  .apply(GroupByKey.of());

That said, my two concerns are:
(1) having two similar but different Java APIs. If we have more idiomatic
way of writing pipelines in Java, we should make that the standard.
Otherwise, users will be confused by seeing "Beam" examples written in
multiple, incompatible syntaxes.
(2) making sure the above is truly idiomatic Java and that it doesn't any
conflicts with the cross-language Beam programming model. I don't think it
does. We have (I believ) chosen to make the Python and Go SDKs idiomatic
for those languages where possible.

If this work is focused on making the Java SDK more idiomatic (and thus
easier for Java users to learn), it seems like a good thing. We should just
make sure it doesn't scope-creep into defining an entirely new DSL or SDK.

[1]
https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function-

On Tue, Mar 13, 2018 at 11:06 AM Romain Manni-Bucau <[email protected]>
wrote:

> Yep
>
> I know the rational and it makes sense but it also increases the entering
> steps for users and is not that smooth in ides, in particular for custom
> code.
>
> So I really think it makes sense to build an user friendly api on top of
> beam core dev one.
>
>
> Le 13 mars 2018 18:35, "Aljoscha Krettek" <[email protected]> a écrit :
>
>>
>> https://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html
>>
>> On 11. Mar 2018, at 22:21, Romain Manni-Bucau <[email protected]>
>> wrote:
>>
>>
>>
>> Le 12 mars 2018 00:16, "Reuven Lax" <[email protected]> a écrit :
>>
>> I think it would be interesting to see what a Java stream-based API would
>> look like. As I mentioned elsewhere, we are not limited to having only one
>> API for Beam.
>>
>> If I remember correctly, a Java stream API was considered for Dataflow
>> back at the very beginning. I don't completely remember why it was
>> rejected, but I suspect at least part of the reason might have been that
>> Java streams were considered too new and untested back then.
>>
>>
>> Coders are broken - typevariables dont have bounds except object - and
>> reducers are not trivial to impl generally I guess.
>>
>> However being close of this api can help a lot so +1 to try to have a
>> java dsl on top of current api. Would also be neat to integrate it with
>> completionstage :).
>>
>>
>>
>> Reuven
>>
>>
>> On Sun, Mar 11, 2018 at 2:29 PM Romain Manni-Bucau <[email protected]>
>> wrote:
>>
>>>
>>>
>>> Le 11 mars 2018 21:18, "Jean-Baptiste Onofré" <[email protected]> a
>>> écrit :
>>>
>>> Hi Romain,
>>>
>>> I remember we have discussed about the way to express pipeline while ago.
>>>
>>> I was fan of a "DSL" compared to the one we have in Camel: instead of
>>> using apply(), use a dedicated form (like .map(), .reduce(), etc, AFAIR,
>>> it's the approach in flume).
>>> However, we agreed that apply() syntax gives a more flexible approach.
>>>
>>> Using Java Stream is interesting but I'm afraid we would have the same
>>> issue as the one we identified discussing "fluent Java SDK". However, we
>>> can have a Stream API DSL on top of the SDK IMHO.
>>>
>>>
>>> Agree and a beam stream interface (copying jdk api but making lambda
>>> serializable to avoid the cast need).
>>>
>>> On my side i think it enables user to discover the api. If you check my
>>> poc impl you quickly see the steps needed to do simple things like a map
>>> which is a first citizen.
>>>
>>> Also curious if we could impl reduce with pipeline result = get an
>>> output of a batch from the runner (client) jvm. I see how to do it for
>>> longs - with metrics - but not for collect().
>>>
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 11/03/2018 19:46, Romain Manni-Bucau wrote:
>>>
>>>> Hi guys,
>>>>
>>>> don't know if you already experienced using java Stream API as a
>>>> replacement for pipeline API but did some tests:
>>>> https://github.com/rmannibucau/jbeam
>>>>
>>>> It is far to be complete but already shows where it fails (beam doesn't
>>>> have a way to reduce in the caller machine for instance, coder handling is
>>>> not that trivial, lambda are not working well with default Stream API
>>>> etc...).
>>>>
>>>> However it is interesting to see that having such an API is pretty
>>>> natural compare to the pipeline API
>>>> so wonder if beam should work on its own Stream API (with surely
>>>> another name for obvious reasons ;)).
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog <
>>>> https://rmannibucau.metawerx.net/> | Old Blog <
>>>> http://rmannibucau.wordpress.com> | Github <
>>>> https://github.com/rmannibucau> | LinkedIn <
>>>> https://www.linkedin.com/in/rmannibucau> | Book <
>>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> >
>>>>
>>>
>>>
>>
>>

Re: (java) stream & beam?

Reply via email to