Gbk can be fluent if you pass a key extractor lambda ;) Le 17 mars 2018 00:00, "Jean-Baptiste Onofré" <j...@nanthrax.net> a écrit :
> Big +1 > > Regards > JB > Le 16 mars 2018, à 15:59, Reuven Lax <re...@google.com> a écrit: >> >> BTW while it's true that raw GBK can't be fluent (due to constraint on >> element type). once we have schema support we can introduce groupByField, >> and that can be fluent. >> >> >> On Wed, Mar 14, 2018 at 11:50 PM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau < >>> rmannibu...@gmail.com> >>> wrote: >>> >>> > Le 15 mars 2018 06:52, "Robert Bradshaw" <rober...@google.com> a >>> écrit : >>> >>> >> The stream API was looked at way back when we were designing the API; >>> one of the primary reasons it was not further pursued at the time was the >>> demand for Java 7 compatibility. It is also much more natural with >>> lambdas, >>> but unfortunately the Java compiler discards types in this case, making >>> coder inference impossible. Still is interesting to explore, and I've >>> been >>> toying with using this wrapping method for other applications >>> (specifically, giving a Pandas Dataframe API to PCollections in Python). >>> >>> >> There's a higher level question lingering here about making things >>> more >>> fluent by putting methods on PCollections in our primary API. It was >>> somewhat of an experiment to go the very pure approach of *everything* >>> being expressed a PTransform, and this is not without its disadvantages, >>> and (gasp) may be worth revisiting. In particular, some things that have >>> changed in the meantime are >>> >>> >> * The Java SDK is no longer *the* definition of the model. The model >>> has >>> been (mostly) formalized in the portability work, and the general Beam >>> concepts and notion of PTransform are much more widely fleshed out and >>> understood. >>> >>> > This is wrong for all java users which are still the mainstream. It is >>> important to keep that in mind and even if I know portable API is >>> something >>> important for you, >>> >>> I think you miss-understood me. My point is that it is now much easier to >>> disentangle the essence of the Beam model (reified in part in the >>> portable >>> API) from the Java API itself (which may evolve more independently, >>> whereas >>> formerly syntactic sugar here would be conflated with core concepts). >>> >>> > it is solething which should stay on top of runners and their api which >>> means java for all but one. >>> >>> > All that to say that the most common default is java. >>> >>> I don't think it'll be that way for long; scala alone might give Java a >>> run >>> for its money. >>> >>> > However I agree each language should have its natural API and should >>> absolutely not just port over the same API. Goal being indeed to respect >>> its own philosophy. >>> >>> > Conclusion: java needs a most expressive stream like API. >>> >>> > There is another way to see it: catching up API debt compared to >>> concurrent API. >>> >>> >>> >> * Java 8's lambdas, etc. allows for much more succinct representation >>> of >>> operations, which makes the relative ratio of boilerplate of using apply >>> that much higher. This is one of the struggles we had with the Python >>> API, >>> pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant. >>> pcoll >>> | Map(...) is at least closer to pcoll.map(...). >>> >> * With over two years of experience with the 100% pure approach, we >>> still haven't "gotten used to it" enough that adding such methods isn't >>> appealing. (Note that by design adding such methods later is always >>> easier >>> than taking them away, which was one justification for starting at the >>> extreme point). >>> >>> >> Even if we go this route, there's no need to remove apply, and >>> >>> >> pcoll >>> >> .map(...) >>> >> .apply(...) >>> >> .flatMap(...) >>> >>> >> flows fairly well (with map/flatMap being syntactic sugar to apply). >>> >>> >> Agree but the issue with that is you loose the natural approach and it >>> is harder to rework it whereas having an api on top of "apply" let you >>> keep >>> both concerns split. >>> >>> Having multiple APIs undesirable, best to have one unless there are hard >>> constraints that prevent it (e.g. if the two would be jarringly >>> inconsistent, or one is forced by an interface, etc.) >>> >>> >> Also pcollection api is what is complex (coders, sides, ...) and what >>> I >>> hope we can hide behind another API. >>> >>> I'd like to simplify things as well. >>> >>> >> I think we would also have to still use apply for parameterless >>> operations like gbk that place constraints on the element types. I don't >>> see how to do combinePerKey either (though, asymmetrically, globalCombine >>> is fine). >>> >>> >> The largest fear I have is feature creep. There would have to be a >>> very >>> clear line of what's in and what's not, likely with what's in being a >>> very >>> short list (which is probably OK and would give the biggest gain, but not >>> much discoverability). The criteria can't be primitives (gbk is >>> problematic, and the most natural map isn't really the full ParDo >>> primitive--in fact the full ParDo might be "advanced" enough to merit >>> requiring apply). >>> >>> > Is the previous proposal an issue (jet api)? >>> >>> On first glance, StreamStage doesn't sound to me like a PCollection >>> (mixes >>> the notion of operations and values), and methods like >>> flatMapUsingContext >>> and hashJoin2 seem far down the slippery slope. But I haven't spent that >>> much time looking at it. >>> >>> >> Who knows, though I still think we made the right decision to attempt >>> apply-only at the time, maybe I'll have to flesh this out into a new blog >>> post that is a rebuttal to my original one :). >>> >>> > Maybe for part of the users, clearly not for the ones I met last 3 >>> months >>> (what they said opening their IDE is censured ;)). >>> >>