Re: [API PROPOSAL] PTransform.getURN, toProto, etc, for Java

Kenneth Knowles Fri, 16 Feb 2024 10:10:41 -0800

My opinion regarding the execution side and symmetry is this: it was always
wrong to use the term "PTransform" to describe the thing that is executed
by workers or SDK harnesses. They aren't the same and shouldn't be thought
of or implemented as the same.


The original Dataflow runner had it right - a runner converts Beam into a
physical plan that is composed of physical operations. S*teps* and *stages* in
Dataflow's case. These do involve invoking user DoFns and other UDFs which
are shared with the pipeline. You can take a look at the Dataflow v1 worker
and you'll see the set of useful steps is neither a subset nor a superset
of what you think of as Beam's core transforms. In fact they are entirely
disjoint despite the temptation to suggest that ParDoStep is "just" ParDo,
which is wrong.

The Beam model on the fn api side isn't as good as the original Dataflow
approach when it comes to this clarity. There was a desire to share bits of
the encoding of a DAG between the Pipeline proto and the
ProcessBundleDescriptor, which is understandable. But honestly it might be
extraneous complexity as the Dataflow v1 worker only executes trees. I
don't know quite where we landed on that. But I think re-using PTransform
with execution-oriented URNs to describe instructions to the SDK harness is
primarily misleading/confusing and saves maybe dozens of lines of code.

Which is all to say that how this may or may not impact the execution side
doesn't matter to me. I would view it as an improvement if they diverged
further. But this change - at first - will just be a refactor of where the
code lives that produces the same particular protos.

Kenn

On Thu, Feb 15, 2024 at 2:48 PM Robert Burke <rob...@frantil.com> wrote:

> +1
>
> While the current Go SDK has always been portability first it was designed
> with a goal of enabling it to back out of that at the time, so it's fully
> on a broad vertical slice of things to translate to protos and back again,
> leading to difficulties when adding a new core transform.
>
> I have an experimental hobby implementation of a Go SDK for prototyping
> things (mostly seeing if Go Generics can make a pipeline compile time
> typesafe, and the answer is yes... but that's a different email) and went
> with emitting out a FunctionSpec, (urn and payload), the env ID, and
> UniqueName, while inputs and outputs were handled with common code.
>
> I still kept Execution side translation to be graph based at the time,
> because of the lost type information, which required additional graph
> context to build the execution side with the right types (eg for SDK side
> source, sink, and flatten handling).
>
> So I question if full symmetry is required. Eg. There's no reason for
> ExternalTransforms to be converted back on execution side, or for GBKs
> (usually that is, I'm looking at you Typescript SDK!). And conversely,
> there are "Execution Side Only" transforms that are never directly written
> by a pipeline or transform author, but are necessary to execute SDK side
> (combine or SDF components for example), even though those have single user
> side constructs.
>
> That just implies that the toProto and fromProto parts are separable
> though.
>
> But that's just that specific experimental design for that specific
> languages affordances.
>
> It's definitely a big plus to be able to see all the bits for a single
> transform in one file, instead of trying to find the 5-8 different places
> once must add a registration for it. More so in Java where such handler
> registrations can be done via class annotations!
>
> Robert Burke
> Beam Go Busybody
>
> On Thu, Feb 15, 2024, 10:37 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> On Wed, Feb 14, 2024 at 10:28 AM Kenneth Knowles <k...@apache.org> wrote:
>> >
>> > Hi all,
>> >
>> > TL;DR I want to add some API like PTransform.getURN, toProto and
>> fromProto, etc. to the Java SDK. I want to do this so that making a
>> PTransform support portability is a natural part of writing the transform
>> and not a totally separate thing with tons of boilerplate.
>> >
>> > What do you think?
>>
>> Huge +1 to this direction.
>>
>> IMHO one of the most fundamental things about Beam is its model.
>> Originally this was only expressed in a specific SDK (Java) and then
>> got ported to others, but now that we have portability it's expressed
>> in a language-independent way.
>>
>> The fact that we keep these separate in Java is not buying us
>> anything, and causes a huge amount of boilerplate that'd be great to
>> remove, as well as making the essential model more front-and-center.
>>
>> > I think a particular API can be sorted out most easily in code (which I
>> will prepare after gathering some feedback).
>> >
>> > We already have all the translation logic written, and porting a couple
>> transforms to it will ensure the API has everything we need. We can refer
>> to Python and Go for API ideas as well.
>> >
>> > Lots of context below, but you can skip it...
>> >
>> > -----
>> >
>> > When we first created the portability framework, we wanted the SDKs to
>> be "standalone" and not depend on portability. We wanted portability to be
>> an optional plugin that users could opt in to. That is totally the opposite
>> now. We want portability to be the main place where Beam is defined, and
>> then SDKs make that available in language idiomatic ways.
>> >
>> > Also when we first created the framework, we were experimenting with
>> different serialization approaches and we wanted to be independent of
>> protobuf and gRPC if we could. But now we are pretty committed and it would
>> be a huge lift to use anything else.
>> >
>> > Finally, at the time we created the portability framework, we designed
>> it to allow composites to have URNs and well-defined specs, rather than
>> just be language-specific subgraphs, but we didn't really plan to make this
>> easy.
>> >
>> > For all of the above, most users depend on portability and on proto. So
>> separating them is not useful and just creates LOTS of boilerplate and
>> friction for making new well-defined transforms.
>> >
>> > Kenn
>>
>

Re: [API PROPOSAL] PTransform.getURN, toProto, etc, for Java

Reply via email to