Re: [PROPOSAL] A brand new DoFn

Aparup Banerjee (apbanerj) Thu, 28 Jul 2016 02:23:54 -0700

+1, Nice presentation




On 7/28/16, 2:17 AM, "Amit Sela" <[email protected]> wrote:

>+1
>
>A great way to avoid "noise" in the code, and as usual, an
>extremely thorough presentation & documentation.
>I'll take a look at the (initial) PR and also add the Spark section in the
>document soon.
>
>Thanks!
>
>On Thu, Jul 28, 2016 at 9:53 AM Aljoscha Krettek <[email protected]>
>wrote:
>
>> +1
>>
>> At first I liked the API but was skeptical because I though that this would
>> require reflective invocation. Then I read on and saw that code generation
>> is used and was convinced. :-)
>>
>> I especially like how it both cleans up the API and allows more
>> optimizations in the future, especially with side inputs and the different
>> methods for emitting.
>>
>> On Wed, 27 Jul 2016 at 06:49 Jean-Baptiste Onofré <[email protected]> wrote:
>>
>> >
>> >
>> > +1
>> > I like the proposal and great description.
>> > ThanksRegards JB
>> >
>> > -------- Original message --------
>> > From: Kenneth Knowles <[email protected]>
>> > Date: 27/07/2016  05:29  (GMT+01:00)
>> > To: [email protected]
>> > Subject: [PROPOSAL] A brand new DoFn
>> >
>> > Hi all,
>> >
>> > I have a major new feature to propose: the next generation of DoFn.
>> >
>> > It sounds a bit grandiose, but I think it is the best way to understand
>> the
>> > proposal.
>> >
>> > This is strongly motivated by the design for state and timers, aka
>> "per-key
>> > workflows". Since the two features are separable and have separate design
>> > docs, I have started a separate thread for each.
>> >
>> > To get a quick overview of the proposal for a new DoFn, and how it
>> improves
>> > upon the flexibility and validation of DoFn, browse this presentation:
>> >
>> >   https://s.apache.org/presenting-a-new-dofn
>> >
>> > Due to the extent of this proposal, Ben & I have also prepared an
>> in-depth
>> > document at https://s.apache.org/a-new-dofn with additional details.
>> > Please
>> > comment on particulars there, or just reply to this email.
>> >
>> > The remainder of this email is yet another summary of the proposal, to
>> > entice you to read the documents above and respond with a "+1".
>> >
>> > This is a feature that has been an experimental feature of the Java SDK
>> for
>> > some time, under the name DoFnWithContext. For the purposes of this email
>> > and the linked documents, I will call it NewDoFn and I will call the
>> status
>> > quo OldDoFn.
>> >
>> > The differences between NewDoFn and and OldDoFn are most easily
>> understood
>> > with a quick code snippet:
>> >
>> >     new OldDoFn<Foo, Baz>() {
>> >       @Override
>> >       public void processElement(ProcessContext c) { … }
>> >     }
>> >
>> >     new NewDoFn<Foo, Baz>() {
>> >       @ProcessElement   // <-- This is the only difference
>> >       public void processElement(ProcessContext c) { … }
>> >     }
>> >
>> > What changed? NewDoFn uses annotation-based dispatch instead of method
>> > overrides. The method annotated with @ProcessElement is used to process
>> > elements. It can have any name or signature, and validation is performed
>> at
>> > pipeline construction time.
>> >
>> > Why do this? It allows the argument list for processElement to change.
>> This
>> > approach gives NewDoFn many advantages, which are demonstrated at length
>> in
>> > the linked documents. Here are some highlights:
>> >
>> >  - Simpler backwards-compatible approaches to new features
>> >  - Simpler composition of advanced features
>> >  - Greater pipeline construction-time validation
>> >  - Easier evolution of a simple anonymous DoFn into one that uses
>> advanced
>> > features
>> >
>> > Here are some abbreviated demonstrations of things that work today or
>> could
>> > work easily with NewDoFn but require complex interrelated designs without
>> > it:
>> >
>> > Access the element's window:
>> >
>> >     new NewDoFn<Foo, Baz>() {
>> >       @ProcessElement
>> >       public void processElement(ProcessContext c, BoundedWindow w) { … }
>> >     }
>> >
>> > Use persistent state:
>> >
>> >     new NewDoFn<Foo, Baz>() {
>> >       @ProcessElement
>> >       public void processElement(
>> >           ProcessContext c,
>> >           @StateId("cell-id") ValueState<MyStateValue> state) {
>> >         …
>> >       }
>> >     }
>> >
>> > Set and receive timers:
>> >
>> >     new NewDoFn<Foo, Baz>() {
>> >       @ProcessElement
>> >       public void processElement(
>> >           ProcessContext c,
>> >           @TimerId("timer-id") Timer state) {
>> >         …
>> >       }
>> >
>> >       @OnTimer("timer-id")
>> >       void onMyTimer(OnTimerContext) { … }
>> >     }
>> >
>> > Receive a side input as a parameter:
>> >
>> >     new NewDoFn<Foo, Baz>() {
>> >       @ProcessElement
>> >       public void processElement(
>> >           ProcessContext c,
>> >           @SideInput Supplier<T> side) {
>> >         …
>> >       }
>> >     }
>> >
>> > So this is what I am proposing: We should move the Beam Java SDK to
>> > NewDoFn!
>> >
>> > My proposed migration plan is:
>> >
>> > 1. leave a git tag before anything, so users can pin to it
>> > 2. mv DoFn OldDoFn && mv DoFnWithContext DoFn
>> > 3. get everything working with all runners
>> > 4. rm OldDoFn # a few weeks later
>> >
>> > This will affect bleeding edge users, who will need to replace @Override
>> > with @ProcessElement in all their DoFns. They can also pin to a commit
>> > prior to the change or temporarily replace DoFn with OldDoFn everywhere.
>> >
>> > I've already done step 2 in a branch at
>> > https://github.com/kennknowles/incubator-beam/DoFnWithContext and
>> ported a
>> > few examples in their own commits. If you view those commits, you can see
>> > how simple the migration path is.
>> >
>> > Please let me know what you think. It is a big change, but one that I
>> think
>> > yields pretty nice rewards.
>> >
>> > Kenn
>> >
>>

Re: [PROPOSAL] A brand new DoFn

Reply via email to