Yes. The very original Python API didn't have GBK, just a
lambda-parameterized groupBy.

On Sat, Mar 17, 2018, 12:21 AM Romain Manni-Bucau <>

> Gbk can be fluent if you pass a key extractor lambda ;)
> Le 17 mars 2018 00:00, "Jean-Baptiste Onofré" <> a écrit :
>> Big +1
>> Regards
>> JB
>> Le 16 mars 2018, à 15:59, Reuven Lax <> a écrit:
>>> BTW while it's true that raw GBK can't be fluent (due to constraint on
>>> element type). once we have schema support we can introduce groupByField,
>>> and that can be fluent.
>>> On Wed, Mar 14, 2018 at 11:50 PM Robert Bradshaw <>
>>> wrote:
>>>> On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau <
>>>> wrote:
>>>> > Le 15 mars 2018 06:52, "Robert Bradshaw" <> a
>>>> écrit :
>>>> >> The stream API was looked at way back when we were designing the API;
>>>> one of the primary reasons it was not further pursued at the time was
>>>> the
>>>> demand for Java 7 compatibility. It is also much more natural with
>>>> lambdas,
>>>> but unfortunately the Java compiler discards types in this case, making
>>>> coder inference impossible. Still is interesting to explore, and I've
>>>> been
>>>> toying with using this wrapping method for other applications
>>>> (specifically, giving a Pandas Dataframe API to PCollections in Python).
>>>> >> There's a higher level question lingering here about making things
>>>> more
>>>> fluent by putting methods on PCollections in our primary API. It was
>>>> somewhat of an experiment to go the very pure approach of *everything*
>>>> being expressed a PTransform, and this is not without its disadvantages,
>>>> and (gasp) may be worth revisiting. In particular, some things that have
>>>> changed in the meantime are
>>>> >> * The Java SDK is no longer *the* definition of the model. The model
>>>> has
>>>> been (mostly) formalized in the portability work, and the general Beam
>>>> concepts and notion of PTransform are much more widely fleshed out and
>>>> understood.
>>>> > This is wrong for all java users which are still the mainstream. It is
>>>> important to keep that in mind and even if I know portable API is
>>>> something
>>>> important for you,
>>>> I think you miss-understood me. My point is that it is now much easier
>>>> to
>>>> disentangle the essence of the Beam model (reified in part in the
>>>> portable
>>>> API) from the Java API itself (which may evolve more independently,
>>>> whereas
>>>> formerly syntactic sugar here would be conflated with core concepts).
>>>> > it is solething which should stay on top of runners and their api
>>>> which
>>>> means java for all but one.
>>>> > All that to say that the most common default is java.
>>>> I don't think it'll be that way for long; scala alone might give Java a
>>>> run
>>>> for its money.
>>>> > However I agree each language should have its natural API and should
>>>> absolutely not just port over the same API. Goal being indeed to respect
>>>> its own philosophy.
>>>> > Conclusion: java needs a most expressive stream like API.
>>>> > There is another way to see it: catching up API debt compared to
>>>> concurrent API.
>>>> >> * Java 8's lambdas, etc. allows for much more succinct
>>>> representation of
>>>> operations, which makes the relative ratio of boilerplate of using apply
>>>> that much higher. This is one of the struggles we had with the Python
>>>> API,
>>>> pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant.
>>>> pcoll
>>>> | Map(...) is at least closer to
>>>> >> * With over two years of experience with the 100% pure approach, we
>>>> still haven't "gotten used to it" enough that adding such methods isn't
>>>> appealing. (Note that by design adding such methods later is always
>>>> easier
>>>> than taking them away, which was one justification for starting at the
>>>> extreme point).
>>>> >> Even if we go this route, there's no need to remove apply, and
>>>> >> pcoll
>>>> >>      .map(...)
>>>> >>      .apply(...)
>>>> >>      .flatMap(...)
>>>> >> flows fairly well (with map/flatMap being syntactic sugar to apply).
>>>> >> Agree but the issue with that is you loose the natural approach and
>>>> it
>>>> is harder to rework it whereas having an api on top of "apply" let you
>>>> keep
>>>> both concerns split.
>>>> Having multiple APIs undesirable, best to have one unless there are hard
>>>> constraints that prevent it (e.g. if the two would be jarringly
>>>> inconsistent, or one is forced by an interface, etc.)
>>>> >> Also pcollection api is what is complex (coders, sides, ...) and
>>>> what I
>>>> hope we can hide behind another API.
>>>> I'd like to simplify things as well.
>>>> >> I think we would also have to still use apply for parameterless
>>>> operations like gbk that place constraints on the element types. I don't
>>>> see how to do combinePerKey either (though, asymmetrically,
>>>> globalCombine
>>>> is fine).
>>>> >> The largest fear I have is feature creep. There would have to be a
>>>> very
>>>> clear line of what's in and what's not, likely with what's in being a
>>>> very
>>>> short list (which is probably OK and would give the biggest gain, but
>>>> not
>>>> much discoverability). The criteria can't be primitives (gbk is
>>>> problematic, and the most natural map isn't really the full ParDo
>>>> primitive--in fact the full ParDo might be "advanced" enough to merit
>>>> requiring apply).
>>>> > Is the previous proposal an issue (jet api)?
>>>> On first glance, StreamStage doesn't sound to me like a PCollection
>>>> (mixes
>>>> the notion of operations and values), and methods like
>>>> flatMapUsingContext
>>>> and hashJoin2 seem far down the slippery slope. But I haven't spent that
>>>> much time looking at it.
>>>> >> Who knows, though I still think we made the right decision to attempt
>>>> apply-only at the time, maybe I'll have to flesh this out into a new
>>>> blog
>>>> post that is a rebuttal to my original one :).
>>>> > Maybe for part of the users, clearly not for the ones I met last 3
>>>> months
>>>> (what they said opening their IDE is censured ;)).

Reply via email to