Le 15 mars 2018 07:50, "Robert Bradshaw" <rober...@google.com> a écrit :
On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau <rmannibu...@gmail.com> wrote: > Le 15 mars 2018 06:52, "Robert Bradshaw" <rober...@google.com> a écrit : >> The stream API was looked at way back when we were designing the API; one of the primary reasons it was not further pursued at the time was the demand for Java 7 compatibility. It is also much more natural with lambdas, but unfortunately the Java compiler discards types in this case, making coder inference impossible. Still is interesting to explore, and I've been toying with using this wrapping method for other applications (specifically, giving a Pandas Dataframe API to PCollections in Python). >> There's a higher level question lingering here about making things more fluent by putting methods on PCollections in our primary API. It was somewhat of an experiment to go the very pure approach of *everything* being expressed a PTransform, and this is not without its disadvantages, and (gasp) may be worth revisiting. In particular, some things that have changed in the meantime are >> * The Java SDK is no longer *the* definition of the model. The model has been (mostly) formalized in the portability work, and the general Beam concepts and notion of PTransform are much more widely fleshed out and understood. > This is wrong for all java users which are still the mainstream. It is important to keep that in mind and even if I know portable API is something important for you, I think you miss-understood me. My point is that it is now much easier to disentangle the essence of the Beam model (reified in part in the portable API) from the Java API itself (which may evolve more independently, whereas formerly syntactic sugar here would be conflated with core concepts). Oh ok. Agree. > it is solething which should stay on top of runners and their api which means java for all but one. > All that to say that the most common default is java. I don't think it'll be that way for long; scala alone might give Java a run for its money. Scala will probably need its own api but also generally goes with the best of breed approach which is the opposite of beam by design (vendor portability gives much more important guarantees but not being always the best) do let see how it goes :). > However I agree each language should have its natural API and should absolutely not just port over the same API. Goal being indeed to respect its own philosophy. > Conclusion: java needs a most expressive stream like API. > There is another way to see it: catching up API debt compared to concurrent API. >> * Java 8's lambdas, etc. allows for much more succinct representation of operations, which makes the relative ratio of boilerplate of using apply that much higher. This is one of the struggles we had with the Python API, pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant. pcoll | Map(...) is at least closer to pcoll.map(...). >> * With over two years of experience with the 100% pure approach, we still haven't "gotten used to it" enough that adding such methods isn't appealing. (Note that by design adding such methods later is always easier than taking them away, which was one justification for starting at the extreme point). >> Even if we go this route, there's no need to remove apply, and >> pcoll >> .map(...) >> .apply(...) >> .flatMap(...) >> flows fairly well (with map/flatMap being syntactic sugar to apply). >> Agree but the issue with that is you loose the natural approach and it is harder to rework it whereas having an api on top of "apply" let you keep both concerns split. Having multiple APIs undesirable, best to have one unless there are hard constraints that prevent it (e.g. if the two would be jarringly inconsistent, or one is forced by an interface, etc.) >> Also pcollection api is what is complex (coders, sides, ...) and what I hope we can hide behind another API. I'd like to simplify things as well. >> I think we would also have to still use apply for parameterless operations like gbk that place constraints on the element types. I don't see how to do combinePerKey either (though, asymmetrically, globalCombine is fine). >> The largest fear I have is feature creep. There would have to be a very clear line of what's in and what's not, likely with what's in being a very short list (which is probably OK and would give the biggest gain, but not much discoverability). The criteria can't be primitives (gbk is problematic, and the most natural map isn't really the full ParDo primitive--in fact the full ParDo might be "advanced" enough to merit requiring apply). > Is the previous proposal an issue (jet api)? On first glance, StreamStage doesn't sound to me like a PCollection (mixes the notion of operations and values), and methods like flatMapUsingContext and hashJoin2 seem far down the slippery slope. But I haven't spent that much time looking at it. Hz has some concept making it way faster like spark etc when used since it hosts the data and execution and you can do data affinity. This part doesnt apply to us but overall their api is nice and smooth to discover. >> Who knows, though I still think we made the right decision to attempt apply-only at the time, maybe I'll have to flesh this out into a new blog post that is a rebuttal to my original one :). > Maybe for part of the users, clearly not for the ones I met last 3 months (what they said opening their IDE is censured ;)).