Gbk can be fluent if you pass a key extractor lambda ;)

Le 17 mars 2018 00:00, "Jean-Baptiste Onofré" <j...@nanthrax.net> a écrit :

> Big +1
>
> Regards
> JB
> Le 16 mars 2018, à 15:59, Reuven Lax <re...@google.com> a écrit:
>>
>> BTW while it's true that raw GBK can't be fluent (due to constraint on
>> element type). once we have schema support we can introduce groupByField,
>> and that can be fluent.
>>
>>
>> On Wed, Mar 14, 2018 at 11:50 PM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau <
>>> rmannibu...@gmail.com>
>>> wrote:
>>>
>>> > Le 15 mars 2018 06:52, "Robert Bradshaw" <rober...@google.com> a
>>> écrit :
>>>
>>> >> The stream API was looked at way back when we were designing the API;
>>> one of the primary reasons it was not further pursued at the time was the
>>> demand for Java 7 compatibility. It is also much more natural with
>>> lambdas,
>>> but unfortunately the Java compiler discards types in this case, making
>>> coder inference impossible. Still is interesting to explore, and I've
>>> been
>>> toying with using this wrapping method for other applications
>>> (specifically, giving a Pandas Dataframe API to PCollections in Python).
>>>
>>> >> There's a higher level question lingering here about making things
>>> more
>>> fluent by putting methods on PCollections in our primary API. It was
>>> somewhat of an experiment to go the very pure approach of *everything*
>>> being expressed a PTransform, and this is not without its disadvantages,
>>> and (gasp) may be worth revisiting. In particular, some things that have
>>> changed in the meantime are
>>>
>>> >> * The Java SDK is no longer *the* definition of the model. The model
>>> has
>>> been (mostly) formalized in the portability work, and the general Beam
>>> concepts and notion of PTransform are much more widely fleshed out and
>>> understood.
>>>
>>> > This is wrong for all java users which are still the mainstream. It is
>>> important to keep that in mind and even if I know portable API is
>>> something
>>> important for you,
>>>
>>> I think you miss-understood me. My point is that it is now much easier to
>>> disentangle the essence of the Beam model (reified in part in the
>>> portable
>>> API) from the Java API itself (which may evolve more independently,
>>> whereas
>>> formerly syntactic sugar here would be conflated with core concepts).
>>>
>>> > it is solething which should stay on top of runners and their api which
>>> means java for all but one.
>>>
>>> > All that to say that the most common default is java.
>>>
>>> I don't think it'll be that way for long; scala alone might give Java a
>>> run
>>> for its money.
>>>
>>> > However I agree each language should have its natural API and should
>>> absolutely not just port over the same API. Goal being indeed to respect
>>> its own philosophy.
>>>
>>> > Conclusion: java needs a most expressive stream like API.
>>>
>>> > There is another way to see it: catching up API debt compared to
>>> concurrent API.
>>>
>>>
>>> >> * Java 8's lambdas, etc. allows for much more succinct representation
>>> of
>>> operations, which makes the relative ratio of boilerplate of using apply
>>> that much higher. This is one of the struggles we had with the Python
>>> API,
>>> pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant.
>>> pcoll
>>> | Map(...) is at least closer to pcoll.map(...).
>>> >> * With over two years of experience with the 100% pure approach, we
>>> still haven't "gotten used to it" enough that adding such methods isn't
>>> appealing. (Note that by design adding such methods later is always
>>> easier
>>> than taking them away, which was one justification for starting at the
>>> extreme point).
>>>
>>> >> Even if we go this route, there's no need to remove apply, and
>>>
>>> >> pcoll
>>> >>      .map(...)
>>> >>      .apply(...)
>>> >>      .flatMap(...)
>>>
>>> >> flows fairly well (with map/flatMap being syntactic sugar to apply).
>>>
>>> >> Agree but the issue with that is you loose the natural approach and it
>>> is harder to rework it whereas having an api on top of "apply" let you
>>> keep
>>> both concerns split.
>>>
>>> Having multiple APIs undesirable, best to have one unless there are hard
>>> constraints that prevent it (e.g. if the two would be jarringly
>>> inconsistent, or one is forced by an interface, etc.)
>>>
>>> >> Also pcollection api is what is complex (coders, sides, ...) and what
>>> I
>>> hope we can hide behind another API.
>>>
>>> I'd like to simplify things as well.
>>>
>>> >> I think we would also have to still use apply for parameterless
>>> operations like gbk that place constraints on the element types. I don't
>>> see how to do combinePerKey either (though, asymmetrically, globalCombine
>>> is fine).
>>>
>>> >> The largest fear I have is feature creep. There would have to be a
>>> very
>>> clear line of what's in and what's not, likely with what's in being a
>>> very
>>> short list (which is probably OK and would give the biggest gain, but not
>>> much discoverability). The criteria can't be primitives (gbk is
>>> problematic, and the most natural map isn't really the full ParDo
>>> primitive--in fact the full ParDo might be "advanced" enough to merit
>>> requiring apply).
>>>
>>> > Is the previous proposal an issue (jet api)?
>>>
>>> On first glance, StreamStage doesn't sound to me like a PCollection
>>> (mixes
>>> the notion of operations and values), and methods like
>>> flatMapUsingContext
>>> and hashJoin2 seem far down the slippery slope. But I haven't spent that
>>> much time looking at it.
>>>
>>> >> Who knows, though I still think we made the right decision to attempt
>>> apply-only at the time, maybe I'll have to flesh this out into a new blog
>>> post that is a rebuttal to my original one :).
>>>
>>> > Maybe for part of the users, clearly not for the ones I met last 3
>>> months
>>> (what they said opening their IDE is censured ;)).
>>>
>>

Reply via email to