Re: Some extensions to the DoFn API

Jean-Baptiste Onofré Mon, 04 Jun 2018 08:04:39 -0700

Thanks ! I will work on this one then ;)

Regards
JB


On 04/06/2018 16:55, Reuven Lax wrote:
> I'll file a JIRA to track the idea.
> 
> On Mon, Jun 4, 2018 at 5:52 PM Jean-Baptiste Onofré <j...@nanthrax.net
> <mailto:j...@nanthrax.net>> wrote:
> 
>     Exactly, that's why something like @xpath or @json-path could be
>     interesting.
> 
>     Regards
>     JB
> 
>     On 04/06/2018 16:48, Reuven Lax wrote:
>     > Interesting. And given that Beam Schemas are recursive (a row can
>     > contain nested rows), we might actually need something like xpath
>     if we
>     > want to make this fully general.
>     >
>     > Reuven
>     >
>     > On Mon, Jun 4, 2018 at 5:45 PM Jean-Baptiste Onofré
>     <j...@nanthrax.net <mailto:j...@nanthrax.net>
>     > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> wrote:
>     >
>     >     Yup, it makes sense, it's what I had in mind.
>     >
>     >     In Apache Camel, in a Processor (similar to a DoFn), we can
>     also pass
>     >     directly languages to the arguments.
>     >
>     >     We can imagine something like:
>     >
>     >     @ProcessElement void process(@json-path("foo") String foo)
>     >
>     >     @ProcessElement void process(@xpath("//foo") String foo)
>     >
>     >     or even a expression language (simple/groovy/whatever).
>     >
>     >     Regards
>     >     JB
>     >
>     >     On 04/06/2018 16:39, Reuven Lax wrote:
>     >     > In the schema branch I have already added some annotations
>     for Schema.
>     >     > However in the future I think we could go even further and
>     allow users
>     >     > to pick individual fields out of the row schema. e.g. the
>     user might
>     >     > have a Schema with 100 fields, but only want to process
>     userId and geo
>     >     > location. I could imagine something like this
>     >     >
>     >     > @ProcessElement void process(@Field("userId") String
>     >     > userId, @Field("latitude") double lat, @Field("longitude")
>     double
>     >     long) {
>     >     > }
>     >     >
>     >     > And Beam could automatically extract the right fields for
>     the user. In
>     >     > fact we could do the same thing with KVs today - supplying
>     annotations
>     >     > to automatically unpack the KV.
>     >     >
>     >     > I do think there are a few nice ways to do side inputs as well,
>     >     but it's
>     >     > more work to design implement which is why I left it off (and
>     >     given that
>     >     > there is some design work, side input annotations should be
>     >     discussed on
>     >     > the dev list before implementation IMO).
>     >     >
>     >     > Reuven
>     >     >
>     >     > On Mon, Jun 4, 2018 at 5:29 PM Jean-Baptiste Onofré
>     >     <j...@nanthrax.net <mailto:j...@nanthrax.net>
>     <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>
>     >     > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>
>     <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>>> wrote:
>     >     >
>     >     >     Hi Reuven,
>     >     >
>     >     >     That's a great improvement for user.
>     >     >
>     >     >     I don't see an easy way to have annotation about side
>     >     input/output.
>     >     >     I think we can also plan some extension annotation about
>     >     schema. Like
>     >     >     @Element(schema = foo) in addition of the type. Thoughts ?
>     >     >
>     >     >     Regards
>     >     >     JB
>     >     >
>     >     >     On 04/06/2018 16:06, Reuven Lax wrote:
>     >     >     > Beam was created with an annotation-based processing API,
>     >     that allows
>     >     >     > the framework to automatically inject parameters to a
>     DoFn's
>     >     process
>     >     >     > method (and also allows the user to mark any method as the
>     >     process
>     >     >     > method using @ProcessElement). However, these annotations
>     >     were never
>     >     >     > completed. A specific set of parameters could be injected
>     >     (e.g. the
>     >     >     > window or PipelineOptions), but for anything else you
>     had to
>     >     access it
>     >     >     > through the ProcessContext. This limited the readability
>     >     advantage of
>     >     >     > this API.
>     >     >     >
>     >     >     > A couple of months ago I spent a bit of time extending the
>     >     set of
>     >     >     > annotations allowed. In particular, the most common
>     uses of
>     >     >     > ProcessContext were accessing the input element and
>     outputting
>     >     >     elements,
>     >     >     > and both of those can now be done without ProcessContext.
>     >     Example
>     >     >     usage:
>     >     >     >
>     >     >     > new DoFn<InputT, OutputT>() {
>     >     >     >   @ProcessElement process(@Element InputT element,
>     >     >     > OutputReceiver<OutputT> out) {
>     >     >     >     out.output(convertInputToOutput(element));
>     >     >     >   }
>     >     >     > }
>     >     >     >
>     >     >     > No need for ProcessContext anywhere in this DoFn! The Beam
>     >     framework
>     >     >     > also does type checking - if the @Element type was not
>     >     InputT, you
>     >     >     would
>     >     >     > have seen an error. Multi-output DoFns also work, using a
>     >     >     > MultiOutputReceiver interface.
>     >     >     >
>     >     >     > I'll update the Beam docs later with this information,
>     but most
>     >     >     > information accessible from ProcessContext,
>     OnTimerContext,
>     >     >     > StartBundleContext, or FinishBundleContext can now be
>     >     accessed via
>     >     >     this
>     >     >     > sort of injection. The main exceptions are side inputs and
>     >     output from
>     >     >     > finishbundle, both of which still require the context
>     objects;
>     >     >     however I
>     >     >     > hope to find time to provide direct access to those as
>     well.
>     >     >     >
>     >     >     > pr/5331 (in progress) converts most of Beam's built-in
>     >     transforms
>     >     >     to use
>     >     >     > this clearer style.
>     >     >     >
>     >     >     > Reuven
>     >     >
>     >     >     --
>     >     >     Jean-Baptiste Onofré
>     >     >     jbono...@apache.org <mailto:jbono...@apache.org>
>     <mailto:jbono...@apache.org <mailto:jbono...@apache.org>>
>     >     <mailto:jbono...@apache.org <mailto:jbono...@apache.org>
>     <mailto:jbono...@apache.org <mailto:jbono...@apache.org>>>
>     >     >     http://blog.nanthrax.net
>     >     >     Talend - http://www.talend.com
>     >     >
>     >
>     >     --
>     >     Jean-Baptiste Onofré
>     >     jbono...@apache.org <mailto:jbono...@apache.org>
>     <mailto:jbono...@apache.org <mailto:jbono...@apache.org>>
>     >     http://blog.nanthrax.net
>     >     Talend - http://www.talend.com
>     >
> 
>     -- 
>     Jean-Baptiste Onofré
>     jbono...@apache.org <mailto:jbono...@apache.org>
>     http://blog.nanthrax.net
>     Talend - http://www.talend.com
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Some extensions to the DoFn API

Reply via email to