Le 15 mars 2018 07:50, "Robert Bradshaw" <rober...@google.com> a écrit :

On Wed, Mar 14, 2018 at 11:04 PM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> Le 15 mars 2018 06:52, "Robert Bradshaw" <rober...@google.com> a écrit :

>> The stream API was looked at way back when we were designing the API;
one of the primary reasons it was not further pursued at the time was the
demand for Java 7 compatibility. It is also much more natural with lambdas,
but unfortunately the Java compiler discards types in this case, making
coder inference impossible. Still is interesting to explore, and I've been
toying with using this wrapping method for other applications
(specifically, giving a Pandas Dataframe API to PCollections in Python).

>> There's a higher level question lingering here about making things more
fluent by putting methods on PCollections in our primary API. It was
somewhat of an experiment to go the very pure approach of *everything*
being expressed a PTransform, and this is not without its disadvantages,
and (gasp) may be worth revisiting. In particular, some things that have
changed in the meantime are

>> * The Java SDK is no longer *the* definition of the model. The model has
been (mostly) formalized in the portability work, and the general Beam
concepts and notion of PTransform are much more widely fleshed out and
understood.

> This is wrong for all java users which are still the mainstream. It is
important to keep that in mind and even if I know portable API is something
important for you,

I think you miss-understood me. My point is that it is now much easier to
disentangle the essence of the Beam model (reified in part in the portable
API) from the Java API itself (which may evolve more independently, whereas
formerly syntactic sugar here would be conflated with core concepts).


Oh ok. Agree.


> it is solething which should stay on top of runners and their api which
means java for all but one.

> All that to say that the most common default is java.

I don't think it'll be that way for long; scala alone might give Java a run
for its money.


Scala will probably need its own api but also generally goes with the best
of breed approach which is the opposite of beam by design (vendor
portability gives much more important guarantees but not being always the
best) do let see how it goes :).


> However I agree each language should have its natural API and should
absolutely not just port over the same API. Goal being indeed to respect
its own philosophy.

> Conclusion: java needs a most expressive stream like API.

> There is another way to see it: catching up API debt compared to
concurrent API.


>> * Java 8's lambdas, etc. allows for much more succinct representation of
operations, which makes the relative ratio of boilerplate of using apply
that much higher. This is one of the struggles we had with the Python API,
pcoll.apply(Map(lambda ...)) made the "apply" feel *very* redundant. pcoll
| Map(...) is at least closer to pcoll.map(...).
>> * With over two years of experience with the 100% pure approach, we
still haven't "gotten used to it" enough that adding such methods isn't
appealing. (Note that by design adding such methods later is always easier
than taking them away, which was one justification for starting at the
extreme point).

>> Even if we go this route, there's no need to remove apply, and

>> pcoll
>>      .map(...)
>>      .apply(...)
>>      .flatMap(...)

>> flows fairly well (with map/flatMap being syntactic sugar to apply).

>> Agree but the issue with that is you loose the natural approach and it
is harder to rework it whereas having an api on top of "apply" let you keep
both concerns split.

Having multiple APIs undesirable, best to have one unless there are hard
constraints that prevent it (e.g. if the two would be jarringly
inconsistent, or one is forced by an interface, etc.)

>> Also pcollection api is what is complex (coders, sides, ...) and what I
hope we can hide behind another API.

I'd like to simplify things as well.

>> I think we would also have to still use apply for parameterless
operations like gbk that place constraints on the element types. I don't
see how to do combinePerKey either (though, asymmetrically, globalCombine
is fine).

>> The largest fear I have is feature creep. There would have to be a very
clear line of what's in and what's not, likely with what's in being a very
short list (which is probably OK and would give the biggest gain, but not
much discoverability). The criteria can't be primitives (gbk is
problematic, and the most natural map isn't really the full ParDo
primitive--in fact the full ParDo might be "advanced" enough to merit
requiring apply).

> Is the previous proposal an issue (jet api)?

On first glance, StreamStage doesn't sound to me like a PCollection (mixes
the notion of operations and values), and methods like flatMapUsingContext
and hashJoin2 seem far down the slippery slope. But I haven't spent that
much time looking at it.


Hz has some concept making it way faster like spark etc when used since it
hosts the data and execution and you can do data affinity. This part doesnt
apply to us but overall their api is nice and smooth to discover.


>> Who knows, though I still think we made the right decision to attempt
apply-only at the time, maybe I'll have to flesh this out into a new blog
post that is a rebuttal to my original one :).

> Maybe for part of the users, clearly not for the ones I met last 3 months
(what they said opening their IDE is censured ;)).

Reply via email to