Big +1 Regards JB
Le 16 mars 2018 à 15:59, à 15:59, Reuven Lax <re...@google.com> a écrit: >BTW while it's true that raw GBK can't be fluent (due to constraint on >element type). once we have schema support we can introduce >groupByField, >and that can be fluent. > > >On Wed, Mar 14, 2018 at 11:50 PM Robert Bradshaw <rober...@google.com> >wrote: > >> On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau ><rmannibu...@gmail.com >> > >> wrote: >> >> > Le 15 mars 2018 06:52, "Robert Bradshaw" <rober...@google.com> a >écrit : >> >> >> The stream API was looked at way back when we were designing the >API; >> one of the primary reasons it was not further pursued at the time was >the >> demand for Java 7 compatibility. It is also much more natural with >lambdas, >> but unfortunately the Java compiler discards types in this case, >making >> coder inference impossible. Still is interesting to explore, and I've >been >> toying with using this wrapping method for other applications >> (specifically, giving a Pandas Dataframe API to PCollections in >Python). >> >> >> There's a higher level question lingering here about making things >more >> fluent by putting methods on PCollections in our primary API. It was >> somewhat of an experiment to go the very pure approach of >*everything* >> being expressed a PTransform, and this is not without its >disadvantages, >> and (gasp) may be worth revisiting. In particular, some things that >have >> changed in the meantime are >> >> >> * The Java SDK is no longer *the* definition of the model. The >model has >> been (mostly) formalized in the portability work, and the general >Beam >> concepts and notion of PTransform are much more widely fleshed out >and >> understood. >> >> > This is wrong for all java users which are still the mainstream. It >is >> important to keep that in mind and even if I know portable API is >something >> important for you, >> >> I think you miss-understood me. My point is that it is now much >easier to >> disentangle the essence of the Beam model (reified in part in the >portable >> API) from the Java API itself (which may evolve more independently, >whereas >> formerly syntactic sugar here would be conflated with core concepts). >> >> > it is solething which should stay on top of runners and their api >which >> means java for all but one. >> >> > All that to say that the most common default is java. >> >> I don't think it'll be that way for long; scala alone might give Java >a run >> for its money. >> >> > However I agree each language should have its natural API and >should >> absolutely not just port over the same API. Goal being indeed to >respect >> its own philosophy. >> >> > Conclusion: java needs a most expressive stream like API. >> >> > There is another way to see it: catching up API debt compared to >> concurrent API. >> >> >> >> * Java 8's lambdas, etc. allows for much more succinct >representation of >> operations, which makes the relative ratio of boilerplate of using >apply >> that much higher. This is one of the struggles we had with the Python >API, >> pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant. >pcoll >> | Map(...) is at least closer to pcoll.map(...). >> >> * With over two years of experience with the 100% pure approach, >we >> still haven't "gotten used to it" enough that adding such methods >isn't >> appealing. (Note that by design adding such methods later is always >easier >> than taking them away, which was one justification for starting at >the >> extreme point). >> >> >> Even if we go this route, there's no need to remove apply, and >> >> >> pcoll >> >> .map(...) >> >> .apply(...) >> >> .flatMap(...) >> >> >> flows fairly well (with map/flatMap being syntactic sugar to >apply). >> >> >> Agree but the issue with that is you loose the natural approach >and it >> is harder to rework it whereas having an api on top of "apply" let >you keep >> both concerns split. >> >> Having multiple APIs undesirable, best to have one unless there are >hard >> constraints that prevent it (e.g. if the two would be jarringly >> inconsistent, or one is forced by an interface, etc.) >> >> >> Also pcollection api is what is complex (coders, sides, ...) and >what I >> hope we can hide behind another API. >> >> I'd like to simplify things as well. >> >> >> I think we would also have to still use apply for parameterless >> operations like gbk that place constraints on the element types. I >don't >> see how to do combinePerKey either (though, asymmetrically, >globalCombine >> is fine). >> >> >> The largest fear I have is feature creep. There would have to be a >very >> clear line of what's in and what's not, likely with what's in being a >very >> short list (which is probably OK and would give the biggest gain, but >not >> much discoverability). The criteria can't be primitives (gbk is >> problematic, and the most natural map isn't really the full ParDo >> primitive--in fact the full ParDo might be "advanced" enough to merit >> requiring apply). >> >> > Is the previous proposal an issue (jet api)? >> >> On first glance, StreamStage doesn't sound to me like a PCollection >(mixes >> the notion of operations and values), and methods like >flatMapUsingContext >> and hashJoin2 seem far down the slippery slope. But I haven't spent >that >> much time looking at it. >> >> >> Who knows, though I still think we made the right decision to >attempt >> apply-only at the time, maybe I'll have to flesh this out into a new >blog >> post that is a rebuttal to my original one :). >> >> > Maybe for part of the users, clearly not for the ones I met last 3 >months >> (what they said opening their IDE is censured ;)). >>