Re: Some extensions to the DoFn API

Reuven Lax Mon, 04 Jun 2018 07:40:05 -0700

In the schema branch I have already added some annotations for Schema.
However in the future I think we could go even further and allow users to
pick individual fields out of the row schema. e.g. the user might have a
Schema with 100 fields, but only want to process userId and geo location. I
could imagine something like this


@ProcessElement void process(@Field("userId") String
userId, @Field("latitude") double lat, @Field("longitude") double long) {
}

And Beam could automatically extract the right fields for the user. In fact
we could do the same thing with KVs today - supplying annotations to
automatically unpack the KV.

I do think there are a few nice ways to do side inputs as well, but it's
more work to design implement which is why I left it off (and given that
there is some design work, side input annotations should be discussed on
the dev list before implementation IMO).

Reuven

On Mon, Jun 4, 2018 at 5:29 PM Jean-Baptiste Onofré <[email protected]> wrote:

> Hi Reuven,
>
> That's a great improvement for user.
>
> I don't see an easy way to have annotation about side input/output.
> I think we can also plan some extension annotation about schema. Like
> @Element(schema = foo) in addition of the type. Thoughts ?
>
> Regards
> JB
>
> On 04/06/2018 16:06, Reuven Lax wrote:
> > Beam was created with an annotation-based processing API, that allows
> > the framework to automatically inject parameters to a DoFn's process
> > method (and also allows the user to mark any method as the process
> > method using @ProcessElement). However, these annotations were never
> > completed. A specific set of parameters could be injected (e.g. the
> > window or PipelineOptions), but for anything else you had to access it
> > through the ProcessContext. This limited the readability advantage of
> > this API.
> >
> > A couple of months ago I spent a bit of time extending the set of
> > annotations allowed. In particular, the most common uses of
> > ProcessContext were accessing the input element and outputting elements,
> > and both of those can now be done without ProcessContext. Example usage:
> >
> > new DoFn<InputT, OutputT>() {
> >   @ProcessElement process(@Element InputT element,
> > OutputReceiver<OutputT> out) {
> >     out.output(convertInputToOutput(element));
> >   }
> > }
> >
> > No need for ProcessContext anywhere in this DoFn! The Beam framework
> > also does type checking - if the @Element type was not InputT, you would
> > have seen an error. Multi-output DoFns also work, using a
> > MultiOutputReceiver interface.
> >
> > I'll update the Beam docs later with this information, but most
> > information accessible from ProcessContext, OnTimerContext,
> > StartBundleContext, or FinishBundleContext can now be accessed via this
> > sort of injection. The main exceptions are side inputs and output from
> > finishbundle, both of which still require the context objects; however I
> > hope to find time to provide direct access to those as well.
> >
> > pr/5331 (in progress) converts most of Beam's built-in transforms to use
> > this clearer style.
> >
> > Reuven
>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Some extensions to the DoFn API

Reply via email to