Re: (java) stream & beam?

Jean-Baptiste Onofré Fri, 16 Mar 2018 16:01:05 -0700

Big +1

Regards
JB


Le 16 mars 2018 à 15:59, à 15:59, Reuven Lax <[email protected]> a écrit:
>BTW while it's true that raw GBK can't be fluent (due to constraint on
>element type). once we have schema support we can introduce
>groupByField,
>and that can be fluent.
>
>
>On Wed, Mar 14, 2018 at 11:50 PM Robert Bradshaw <[email protected]>
>wrote:
>
>> On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau
><[email protected]
>> >
>> wrote:
>>
>> > Le 15 mars 2018 06:52, "Robert Bradshaw" <[email protected]> a
>écrit :
>>
>> >> The stream API was looked at way back when we were designing the
>API;
>> one of the primary reasons it was not further pursued at the time was
>the
>> demand for Java 7 compatibility. It is also much more natural with
>lambdas,
>> but unfortunately the Java compiler discards types in this case,
>making
>> coder inference impossible. Still is interesting to explore, and I've
>been
>> toying with using this wrapping method for other applications
>> (specifically, giving a Pandas Dataframe API to PCollections in
>Python).
>>
>> >> There's a higher level question lingering here about making things
>more
>> fluent by putting methods on PCollections in our primary API. It was
>> somewhat of an experiment to go the very pure approach of
>*everything*
>> being expressed a PTransform, and this is not without its
>disadvantages,
>> and (gasp) may be worth revisiting. In particular, some things that
>have
>> changed in the meantime are
>>
>> >> * The Java SDK is no longer *the* definition of the model. The
>model has
>> been (mostly) formalized in the portability work, and the general
>Beam
>> concepts and notion of PTransform are much more widely fleshed out
>and
>> understood.
>>
>> > This is wrong for all java users which are still the mainstream. It
>is
>> important to keep that in mind and even if I know portable API is
>something
>> important for you,
>>
>> I think you miss-understood me. My point is that it is now much
>easier to
>> disentangle the essence of the Beam model (reified in part in the
>portable
>> API) from the Java API itself (which may evolve more independently,
>whereas
>> formerly syntactic sugar here would be conflated with core concepts).
>>
>> > it is solething which should stay on top of runners and their api
>which
>> means java for all but one.
>>
>> > All that to say that the most common default is java.
>>
>> I don't think it'll be that way for long; scala alone might give Java
>a run
>> for its money.
>>
>> > However I agree each language should have its natural API and
>should
>> absolutely not just port over the same API. Goal being indeed to
>respect
>> its own philosophy.
>>
>> > Conclusion: java needs a most expressive stream like API.
>>
>> > There is another way to see it: catching up API debt compared to
>> concurrent API.
>>
>>
>> >> * Java 8's lambdas, etc. allows for much more succinct
>representation of
>> operations, which makes the relative ratio of boilerplate of using
>apply
>> that much higher. This is one of the struggles we had with the Python
>API,
>> pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant.
>pcoll
>> | Map(...) is at least closer to pcoll.map(...).
>> >> * With over two years of experience with the 100% pure approach,
>we
>> still haven't "gotten used to it" enough that adding such methods
>isn't
>> appealing. (Note that by design adding such methods later is always
>easier
>> than taking them away, which was one justification for starting at
>the
>> extreme point).
>>
>> >> Even if we go this route, there's no need to remove apply, and
>>
>> >> pcoll
>> >>      .map(...)
>> >>      .apply(...)
>> >>      .flatMap(...)
>>
>> >> flows fairly well (with map/flatMap being syntactic sugar to
>apply).
>>
>> >> Agree but the issue with that is you loose the natural approach
>and it
>> is harder to rework it whereas having an api on top of "apply" let
>you keep
>> both concerns split.
>>
>> Having multiple APIs undesirable, best to have one unless there are
>hard
>> constraints that prevent it (e.g. if the two would be jarringly
>> inconsistent, or one is forced by an interface, etc.)
>>
>> >> Also pcollection api is what is complex (coders, sides, ...) and
>what I
>> hope we can hide behind another API.
>>
>> I'd like to simplify things as well.
>>
>> >> I think we would also have to still use apply for parameterless
>> operations like gbk that place constraints on the element types. I
>don't
>> see how to do combinePerKey either (though, asymmetrically,
>globalCombine
>> is fine).
>>
>> >> The largest fear I have is feature creep. There would have to be a
>very
>> clear line of what's in and what's not, likely with what's in being a
>very
>> short list (which is probably OK and would give the biggest gain, but
>not
>> much discoverability). The criteria can't be primitives (gbk is
>> problematic, and the most natural map isn't really the full ParDo
>> primitive--in fact the full ParDo might be "advanced" enough to merit
>> requiring apply).
>>
>> > Is the previous proposal an issue (jet api)?
>>
>> On first glance, StreamStage doesn't sound to me like a PCollection
>(mixes
>> the notion of operations and values), and methods like
>flatMapUsingContext
>> and hashJoin2 seem far down the slippery slope. But I haven't spent
>that
>> much time looking at it.
>>
>> >> Who knows, though I still think we made the right decision to
>attempt
>> apply-only at the time, maybe I'll have to flesh this out into a new
>blog
>> post that is a rebuttal to my original one :).
>>
>> > Maybe for part of the users, clearly not for the ones I met last 3
>months
>> (what they said opening their IDE is censured ;)).
>>

Re: (java) stream & beam?

Reply via email to