Re: Some extensions to the DoFn API

Reuven Lax Mon, 04 Jun 2018 07:56:04 -0700

I'll file a JIRA to track the idea.

On Mon, Jun 4, 2018 at 5:52 PM Jean-Baptiste Onofré <[email protected]> wrote:


> Exactly, that's why something like @xpath or @json-path could be
> interesting.
>
> Regards
> JB
>
> On 04/06/2018 16:48, Reuven Lax wrote:
> > Interesting. And given that Beam Schemas are recursive (a row can
> > contain nested rows), we might actually need something like xpath if we
> > want to make this fully general.
> >
> > Reuven
> >
> > On Mon, Jun 4, 2018 at 5:45 PM Jean-Baptiste Onofré <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Yup, it makes sense, it's what I had in mind.
> >
> >     In Apache Camel, in a Processor (similar to a DoFn), we can also pass
> >     directly languages to the arguments.
> >
> >     We can imagine something like:
> >
> >     @ProcessElement void process(@json-path("foo") String foo)
> >
> >     @ProcessElement void process(@xpath("//foo") String foo)
> >
> >     or even a expression language (simple/groovy/whatever).
> >
> >     Regards
> >     JB
> >
> >     On 04/06/2018 16:39, Reuven Lax wrote:
> >     > In the schema branch I have already added some annotations for
> Schema.
> >     > However in the future I think we could go even further and allow
> users
> >     > to pick individual fields out of the row schema. e.g. the user
> might
> >     > have a Schema with 100 fields, but only want to process userId and
> geo
> >     > location. I could imagine something like this
> >     >
> >     > @ProcessElement void process(@Field("userId") String
> >     > userId, @Field("latitude") double lat, @Field("longitude") double
> >     long) {
> >     > }
> >     >
> >     > And Beam could automatically extract the right fields for the
> user. In
> >     > fact we could do the same thing with KVs today - supplying
> annotations
> >     > to automatically unpack the KV.
> >     >
> >     > I do think there are a few nice ways to do side inputs as well,
> >     but it's
> >     > more work to design implement which is why I left it off (and
> >     given that
> >     > there is some design work, side input annotations should be
> >     discussed on
> >     > the dev list before implementation IMO).
> >     >
> >     > Reuven
> >     >
> >     > On Mon, Jun 4, 2018 at 5:29 PM Jean-Baptiste Onofré
> >     <[email protected] <mailto:[email protected]>
> >     > <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >     >
> >     >     Hi Reuven,
> >     >
> >     >     That's a great improvement for user.
> >     >
> >     >     I don't see an easy way to have annotation about side
> >     input/output.
> >     >     I think we can also plan some extension annotation about
> >     schema. Like
> >     >     @Element(schema = foo) in addition of the type. Thoughts ?
> >     >
> >     >     Regards
> >     >     JB
> >     >
> >     >     On 04/06/2018 16:06, Reuven Lax wrote:
> >     >     > Beam was created with an annotation-based processing API,
> >     that allows
> >     >     > the framework to automatically inject parameters to a DoFn's
> >     process
> >     >     > method (and also allows the user to mark any method as the
> >     process
> >     >     > method using @ProcessElement). However, these annotations
> >     were never
> >     >     > completed. A specific set of parameters could be injected
> >     (e.g. the
> >     >     > window or PipelineOptions), but for anything else you had to
> >     access it
> >     >     > through the ProcessContext. This limited the readability
> >     advantage of
> >     >     > this API.
> >     >     >
> >     >     > A couple of months ago I spent a bit of time extending the
> >     set of
> >     >     > annotations allowed. In particular, the most common uses of
> >     >     > ProcessContext were accessing the input element and
> outputting
> >     >     elements,
> >     >     > and both of those can now be done without ProcessContext.
> >     Example
> >     >     usage:
> >     >     >
> >     >     > new DoFn<InputT, OutputT>() {
> >     >     >   @ProcessElement process(@Element InputT element,
> >     >     > OutputReceiver<OutputT> out) {
> >     >     >     out.output(convertInputToOutput(element));
> >     >     >   }
> >     >     > }
> >     >     >
> >     >     > No need for ProcessContext anywhere in this DoFn! The Beam
> >     framework
> >     >     > also does type checking - if the @Element type was not
> >     InputT, you
> >     >     would
> >     >     > have seen an error. Multi-output DoFns also work, using a
> >     >     > MultiOutputReceiver interface.
> >     >     >
> >     >     > I'll update the Beam docs later with this information, but
> most
> >     >     > information accessible from ProcessContext, OnTimerContext,
> >     >     > StartBundleContext, or FinishBundleContext can now be
> >     accessed via
> >     >     this
> >     >     > sort of injection. The main exceptions are side inputs and
> >     output from
> >     >     > finishbundle, both of which still require the context
> objects;
> >     >     however I
> >     >     > hope to find time to provide direct access to those as well.
> >     >     >
> >     >     > pr/5331 (in progress) converts most of Beam's built-in
> >     transforms
> >     >     to use
> >     >     > this clearer style.
> >     >     >
> >     >     > Reuven
> >     >
> >     >     --
> >     >     Jean-Baptiste Onofré
> >     >     [email protected] <mailto:[email protected]>
> >     <mailto:[email protected] <mailto:[email protected]>>
> >     >     http://blog.nanthrax.net
> >     >     Talend - http://www.talend.com
> >     >
> >
> >     --
> >     Jean-Baptiste Onofré
> >     [email protected] <mailto:[email protected]>
> >     http://blog.nanthrax.net
> >     Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Some extensions to the DoFn API

Reply via email to