BTW while it's true that raw GBK can't be fluent (due to constraint on element type). once we have schema support we can introduce groupByField, and that can be fluent.
On Wed, Mar 14, 2018 at 11:50 PM Robert Bradshaw <rober...@google.com> wrote: > On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau <rmannibu...@gmail.com > > > wrote: > > > Le 15 mars 2018 06:52, "Robert Bradshaw" <rober...@google.com> a écrit : > > >> The stream API was looked at way back when we were designing the API; > one of the primary reasons it was not further pursued at the time was the > demand for Java 7 compatibility. It is also much more natural with lambdas, > but unfortunately the Java compiler discards types in this case, making > coder inference impossible. Still is interesting to explore, and I've been > toying with using this wrapping method for other applications > (specifically, giving a Pandas Dataframe API to PCollections in Python). > > >> There's a higher level question lingering here about making things more > fluent by putting methods on PCollections in our primary API. It was > somewhat of an experiment to go the very pure approach of *everything* > being expressed a PTransform, and this is not without its disadvantages, > and (gasp) may be worth revisiting. In particular, some things that have > changed in the meantime are > > >> * The Java SDK is no longer *the* definition of the model. The model has > been (mostly) formalized in the portability work, and the general Beam > concepts and notion of PTransform are much more widely fleshed out and > understood. > > > This is wrong for all java users which are still the mainstream. It is > important to keep that in mind and even if I know portable API is something > important for you, > > I think you miss-understood me. My point is that it is now much easier to > disentangle the essence of the Beam model (reified in part in the portable > API) from the Java API itself (which may evolve more independently, whereas > formerly syntactic sugar here would be conflated with core concepts). > > > it is solething which should stay on top of runners and their api which > means java for all but one. > > > All that to say that the most common default is java. > > I don't think it'll be that way for long; scala alone might give Java a run > for its money. > > > However I agree each language should have its natural API and should > absolutely not just port over the same API. Goal being indeed to respect > its own philosophy. > > > Conclusion: java needs a most expressive stream like API. > > > There is another way to see it: catching up API debt compared to > concurrent API. > > > >> * Java 8's lambdas, etc. allows for much more succinct representation of > operations, which makes the relative ratio of boilerplate of using apply > that much higher. This is one of the struggles we had with the Python API, > pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant. pcoll > | Map(...) is at least closer to pcoll.map(...). > >> * With over two years of experience with the 100% pure approach, we > still haven't "gotten used to it" enough that adding such methods isn't > appealing. (Note that by design adding such methods later is always easier > than taking them away, which was one justification for starting at the > extreme point). > > >> Even if we go this route, there's no need to remove apply, and > > >> pcoll > >> .map(...) > >> .apply(...) > >> .flatMap(...) > > >> flows fairly well (with map/flatMap being syntactic sugar to apply). > > >> Agree but the issue with that is you loose the natural approach and it > is harder to rework it whereas having an api on top of "apply" let you keep > both concerns split. > > Having multiple APIs undesirable, best to have one unless there are hard > constraints that prevent it (e.g. if the two would be jarringly > inconsistent, or one is forced by an interface, etc.) > > >> Also pcollection api is what is complex (coders, sides, ...) and what I > hope we can hide behind another API. > > I'd like to simplify things as well. > > >> I think we would also have to still use apply for parameterless > operations like gbk that place constraints on the element types. I don't > see how to do combinePerKey either (though, asymmetrically, globalCombine > is fine). > > >> The largest fear I have is feature creep. There would have to be a very > clear line of what's in and what's not, likely with what's in being a very > short list (which is probably OK and would give the biggest gain, but not > much discoverability). The criteria can't be primitives (gbk is > problematic, and the most natural map isn't really the full ParDo > primitive--in fact the full ParDo might be "advanced" enough to merit > requiring apply). > > > Is the previous proposal an issue (jet api)? > > On first glance, StreamStage doesn't sound to me like a PCollection (mixes > the notion of operations and values), and methods like flatMapUsingContext > and hashJoin2 seem far down the slippery slope. But I haven't spent that > much time looking at it. > > >> Who knows, though I still think we made the right decision to attempt > apply-only at the time, maybe I'll have to flesh this out into a new blog > post that is a rebuttal to my original one :). > > > Maybe for part of the users, clearly not for the ones I met last 3 months > (what they said opening their IDE is censured ;)). >