Re: [PROPOSAL] Additional design for the Beam Python State and Timers API

Robert Bradshaw Mon, 05 Nov 2018 00:39:39 -0800

On Fri, Oct 26, 2018 at 6:47 PM Kenneth Knowles <[email protected]> wrote:
>
> It all sounds very useful but I have basic concerns about item 1. The doc 
> doesn't really seem to go into the design concerns that I have in mind.
>
>  - map / flatMap are universal functions with definitions that we don't own 
> and shouldn't violate
>  - corollary: map / flatMap have per element parallelism with no dependencies 
> between them
>  - the doc says "map is just implemented as DoFn.process()" which is 
> implementation, not the spec


That's an interesting perspective. I had always thought of Map/FlatMap
as convenient ways of defining a ParDo from a lambda without the
baggage of having to provide a full DoFn class, similar to MapElements
in Java, but your interpretation is that the names themselves imply no
inter-element interaction (e.g. via state) is allowed.

> So suggestions:
>
>  - how about just give it a new name making it clear it is not map nor 
> flatMap and does not have per-element parallelism
>  - spec out the functionality without reference to DoFn
>  - be explicit about what determines the maximum parallelism / which elements 
> are required to be processed serially (generally key+window)

These probably don't merit a new (pair of) named operation(s). The
motivation to add them was "why can I use state in a DoFn, but not in
Map/FlatMap?" which could be justified by the above.

As for the side input question, definitely (A).

> On Fri, Oct 26, 2018 at 2:49 AM Robert Bradshaw <[email protected]> wrote:
>>
>> Thanks. They make sense to me (and would have been handy when I was
>> writing the state tests for the Fn API).
>> On Fri, Oct 26, 2018 at 10:48 AM Charles Chen <[email protected]> wrote:
>> >
>> > Hey there,
>> >
>> > A while back, I shared the Beam Python State and Timers API proposal 
>> > (https://s.apache.org/beam-python-user-state-and-timers) with this list 
>> > [1]; we reached consensus on the features proposed there and I implemented 
>> > the API surface described there, along with the reference DirectRunner 
>> > implementation and some shared code for executing such pipelines (see for 
>> > example https://github.com/apache/beam/pull/5691 and 
>> > https://github.com/apache/beam/pull/6304).
>> >
>> > I would like to propose some additional design considerations to improve 
>> > upon the previous State / Timer API proposal as described here (on pages 
>> > 14-18 that I have appended to the doc): 
>> > https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.10nb33sz7u16
>> >
>> > Briefly, these are the additional design considerations proposed to 
>> > improve upon the API:
>> >
>> > Allowing access to user state / timers in Map / FlatMap callables / lambdas
>> > Allowing access to the key during the timer callback
>> > Allowing access to auxiliary timer data
>> > Allowing access to side inputs during the timer callback
>> >
>> > I would really appreciate any feedback you may have.  Thanks!
>> >
>> > Best,
>> > Charles
>> >
>> >
>> > [1] Previous dev@ discussion thread: 
>> > https://lists.apache.org/thread.html/51ba1a00027ad8635bc1d2c0df805ce873995170c75d6a08dfe21997@%3Cdev.beam.apache.org%3E

Re: [PROPOSAL] Additional design for the Beam Python State and Timers API

Reply via email to