Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

Eugene Kirpichov Wed, 12 Oct 2016 17:56:39 -0700

Hey all,

An update: https://github.com/apache/incubator-beam/pull/896 has been
merged, laying groundwork and adding support for splittable DoFn to the
in-memory runner.


What this PR does:
- It defines an API, in full accordance with the proposal discussed on this
thread.
- It adds a mostly runner-agnostic expansion of the ParDo transform for a
splittable DoFn, with one runner-specific primitive transform that needs to
be overridden by every runner.
- It overrides said transform in the in-memory runner, so this works
end-to-end in the in-memory runner.
- All this code is covered by tests (unit and integration
@RunnableOnService) and appears to work properly in combination with the
rest of the Beam model: e.g., inputs to a splittable DoFn can be windowed,
and their windows and timestamps are transparently propagated.

Caveats:
- The API is marked @Experimental, but this is an understatement: it is
assumed to be in flux and is not intended to be used yet. Overwhelmingly
likely, it *will* change in incompatible ways. DO NOT write pipelines with
this transform yet.
- It only works in the in-memory runner: the vast majority of code is
runner-agnostic, but a central runner-specific primitive transform is only
overridden by the in-memory runner.

My immediate next plan is to make this work in the Cloud Dataflow streaming
runner (since this is the runner I'm most familiar with), in order to get
experience with what kind of runner hooks are needed and to put the API in
shape for adding hooks for other runners - and then work either myself or
with the community on making it work in other runners too. Once all runners
sufficiently support a particular subset of features, we can start
transitioning some connectors or writing new ones using that subset (I
expect that streaming connectors will come first).

Additionally, the Python SDK is considering using Splittable DoFn as the
*only* API for streaming sources (right now it doesn't have any API for
that, so there's no compatibility concerns). No implementation work has
happened yet, but it seems like a good idea.

On Tue, Aug 30, 2016 at 1:45 AM Aljoscha Krettek <[email protected]>
wrote:

> Thanks for the explanation Eugene and JB.
>
> By the way, I'm not trying to find holes in this, I really like the
> feature. I just sometimes wonder how a specific thing might be implemented
> with this.
>
> On Mon, 29 Aug 2016 at 18:00 Eugene Kirpichov <[email protected]
> >
> wrote:
>
> > Hi Aljoscha,
> >
> > The watermark reporting is done via
> > ProcessContinuation.futureOutputWatermark, at the granularity of
> returning
> > from individual processElement() calls - you return from the call and
> give
> > a watermark on your future output. We assume that updating watermark is
> > sufficient at a per-bundle level (or, if not, then that you can make
> > bundles small enough) cause that's the same level at which state changes,
> > timers etc. are committed.
> > It can be implemented by setting a per-key watermark hold and updating it
> > when each call for this element returns. That's the way it is implemented
> > in my current prototype
> https://github.com/apache/incubator-beam/pull/896
> > (see
> > SplittableParDo.ProcessFn)
> >
> > On Mon, Aug 29, 2016 at 2:55 AM Aljoscha Krettek <[email protected]>
> > wrote:
> >
> > > Hi,
> > > I have another question about this: currently, unbounded sources have
> > > special logic for determining the watermark and the system periodically
> > > asks the sources for the current watermark. As I understood it,
> > watermarks
> > > are only "generated" at the sources. How will this work when sources
> are
> > > implemented as a combination of DoFns and SplittableDoFns? Will
> > > SplittableDoFns be asked for a watermark, does this mean that
> watermarks
> > > can then be "generated" at any operation?
> > >
> > > Cheers,
> > > Aljoscha
> > >
> > > On Sun, 21 Aug 2016 at 22:34 Eugene Kirpichov
> > <[email protected]
> > > >
> > > wrote:
> > >
> > > > Hi JB,
> > > >
> > > > Yes, I'm assuming you're referring to the "magic" part on the
> transform
> > > > expansion diagram. This is indeed runner-specific, and timers+state
> are
> > > > likely the simplest way to do this for an SDF that does unbounded
> > amount
> > > of
> > > > work.
> > > >
> > > > On Sun, Aug 21, 2016 at 12:14 PM Jean-Baptiste Onofré <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > Anyway, from a runner perspective, we will have kind of API (part
> of
> > > the
> > > > > Runner API) to "orchestrate" the SDF as we discussed during the
> call,
> > > > > right ?
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On 08/21/2016 07:24 PM, Eugene Kirpichov wrote:
> > > > > > Hi Aljoscha,
> > > > > > This is an excellent question! And the answer is, we don't need
> any
> > > new
> > > > > > concepts like "SDF executor" and can rely on the per-key state
> and
> > > > timers
> > > > > > machinery that already exists in all runners because it's
> necessary
> > > to
> > > > > > implement windowing/triggering properly.
> > > > > >
> > > > > > Note that this is already somewhat addressed in the previously
> > posted
> > > > > State
> > > > > > and Timers proposal https://s.apache.org/beam-state , under
> > "per-key
> > > > > > workflows".
> > > > > >
> > > > > > Think of it this way, using the Kafka example: we'll expand it
> > into a
> > > > > > transform:
> > > > > >
> > > > > > (1) ParDo { topic -> (unique key, topic, partition, [0, inf)))
> for
> > > > > > partition in topic.listPartitions() }
> > > > > > (2) GroupByKey
> > > > > > (3) ParDo { key, topic, partition, R -> Kafka reader code in the
> > > > > > proposal/slides }
> > > > > >   - R is the OffsetRange restriction which in this case will be
> > > always
> > > > of
> > > > > > the form [startOffset, inf).
> > > > > >   - there'll be just 1 value per key, but we use GBK to just get
> > > access
> > > > > to
> > > > > > the per-key state/timers machinery. This may be runner-specific;
> > > maybe
> > > > > some
> > > > > > runners don't need a GBK to do that.
> > > > > >
> > > > > > Now suppose the topic has two partitions, P1 and P2, and they get
> > > > > assigned
> > > > > > unique keys K1, K2.
> > > > > > Then the input to (3) will be a collection of: (K1, topic, P1,
> [0,
> > > > inf)),
> > > > > > (K2, topic, P2, [0, inf)).
> > > > > > Suppose we have just 1 worker with just 1 thread. Now, how will
> > this
> > > > > thread
> > > > > > be able to produce elements from both P1 and P2? here's how.
> > > > > >
> > > > > > The thread will process (K1, topic, P1, [0, inf)), checkpoint
> > after a
> > > > > > certain time or after a certain number of elements are output
> (just
> > > > like
> > > > > > with the current UnboundedSource reading code) producing a
> residual
> > > > > > restriction R1' (basically a new start timestamp), put R11 into
> the
> > > > > per-key
> > > > > > state and set a timer T1 to resume.
> > > > > > Then it will process (K2, topic, P2, [0, inf)), do the same
> > > producing a
> > > > > > residual restriction R2' and setting a timer T2 to resume.
> > > > > > Then timer T1 will fire in the context of the key K1. The thread
> > will
> > > > > call
> > > > > > processElement again, this time supplying R1' as the restriction;
> > the
> > > > > > process repeats and after a while it checkpoints and stores R1''
> > into
> > > > > state
> > > > > > of K1.
> > > > > > Then timer T2 will fire in the context of K2, run processElement
> > for
> > > a
> > > > > > while, set a new timer and store R2'' into the state of K2.
> > > > > > Etc.
> > > > > > If partition 1 goes away, the processElement call will return "do
> > not
> > > > > > resume", so a timer will not be set and instead the state
> > associated
> > > > with
> > > > > > K1 will be GC'd.
> > > > > >
> > > > > > So basically it's almost like cooperative thread scheduling:
> things
> > > run
> > > > > for
> > > > > > a while, until the runner tells them to checkpoint, then they
> set a
> > > > timer
> > > > > > to resume themselves, and the runner fires the timers, and the
> > > process
> > > > > > repeats. And, again, this only requires things that runners can
> > > already
> > > > > do
> > > > > > - state and timers, but no new concept of SDF executor (and
> > > > consequently
> > > > > no
> > > > > > necessity to choose/tune how many you need).
> > > > > >
> > > > > > Makes sense?
> > > > > >
> > > > > > On Sat, Aug 20, 2016 at 9:46 AM Aljoscha Krettek <
> > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > >> Hi,
> > > > > >> I have another question that I think wasn't addressed in the
> > > meeting.
> > > > At
> > > > > >> least it wasn't mentioned in the notes.
> > > > > >>
> > > > > >> In the context of replacing sources by a combination of to SDFs,
> > how
> > > > do
> > > > > you
> > > > > >> determine how many "SDF executor" instances you need downstream?
> > For
> > > > the
> > > > > >> sake of argument assume that both SDFs are executed with
> > > parallelism 1
> > > > > (or
> > > > > >> one per worker). Now, if you have a file source that reads from
> a
> > > > static
> > > > > >> set of files the first SDF would emit the filenames while the
> > second
> > > > SDF
> > > > > >> would receive the filenames and emit their contents. This works
> > well
> > > > and
> > > > > >> the downstream SDF can process one filename after the other.
> Now,
> > > > think
> > > > > of
> > > > > >> something like a Kafka source. The first SDF would emit the
> > > partitions
> > > > > (say
> > > > > >> 4 partitions, in this example) and the second SDF would be
> > > responsible
> > > > > for
> > > > > >> reading from a topic and emitting elements. Reading from one
> topic
> > > > never
> > > > > >> finishes so you can't process the topics in series. I think you
> > > would
> > > > > need
> > > > > >> to have 4 downstream "SDF executor" instances. The question now
> > is:
> > > > how
> > > > > do
> > > > > >> you determine whether you are in the first or the second
> > situation?
> > > > > >>
> > > > > >> Probably I'm just overlooking something and this is already
> dealt
> > > with
> > > > > >> somewhere... :-)
> > > > > >>
> > > > > >> Cheers,
> > > > > >> Aljoscha
> > > > > >>
> > > > > >> On Fri, 19 Aug 2016 at 21:02 Ismaël Mejía <[email protected]>
> > > wrote:
> > > > > >>
> > > > > >>> Hello,
> > > > > >>>
> > > > > >>> Thanks for the notes both Dan and Eugene, and for taking the
> time
> > > to
> > > > do
> > > > > >> the
> > > > > >>> presentation and  answer our questions.
> > > > > >>>
> > > > > >>> I mentioned the ongoing work on dynamic scaling on Flink
> because
> > I
> > > > > >> suppose
> > > > > >>> that it will address dynamic rebalancing eventually (there are
> > > > multiple
> > > > > >>> changes going on for dynamic scaling).
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1G1OS1z3xEBOrYD4wSu-LuBCyPUWyFd9l3T9WyssQ63w/edit#heading=h.2suhu1fjxmp4
> > > > > >>>
> > > > > >>>
> > > >
> https://lists.apache.org/[email protected]:lte=1M:FLIP-8
> > > > > >>>
> > > > > >>> Anyway I am far from an expert on flink, but probably the flink
> > > guys
> > > > > can
> > > > > >>> give their opinion about this and refer to a more precise
> > document
> > > > that
> > > > > >> the
> > > > > >>> ones I mentioned..
> > > > > >>>
> > > > > >>> Thanks again,
> > > > > >>> Ismaël
> > > > > >>>
> > > > > >>> On Fri, Aug 19, 2016 at 8:52 PM, Jean-Baptiste Onofré <
> > > > [email protected]
> > > > > >
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>> Great summary Eugene and Dan.
> > > > > >>>>
> > > > > >>>> And thanks again for the details, explanation, and discussion.
> > > > > >>>>
> > > > > >>>> Regards
> > > > > >>>> JB
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On 08/19/2016 08:16 PM, Eugene Kirpichov wrote:
> > > > > >>>>
> > > > > >>>>> Thanks for attending, everybody!
> > > > > >>>>>
> > > > > >>>>> Here are meeting notes (thanks Dan!).
> > > > > >>>>>
> > > > > >>>>> Q: Will SplittableDoFn enable better repartitioning of the
> > > > > >> input/output
> > > > > >>>>> data?
> > > > > >>>>> A: Not really; repartitioning is orthogonal to SDF.
> > > > > >>>>>
> > > > > >>>>> Current Source API suffers from lack of composition and
> > > scalability
> > > > > >>>>> because
> > > > > >>>>> we treat sources too much as metadata, not enough as data.
> > > > > >>>>>
> > > > > >>>>> Q(slide with transform expansion): who does the "magic"?
> > > > > >>>>> A: The runner. Checkpointing and dynamically splitting
> > > restrictions
> > > > > >> will
> > > > > >>>>> require collaboration with the runner.
> > > > > >>>>>
> > > > > >>>>> Q: How does the runner interact with the DoFn to control the
> > > > > >>> restrictions?
> > > > > >>>>> Is it related to the centralized job tracker etc.?
> > > > > >>>>> A: RestrictionTracker is a simple helper object, that exists
> > > purely
> > > > > on
> > > > > >>> the
> > > > > >>>>> worker while executing a single partition, and interacts with
> > the
> > > > > >> worker
> > > > > >>>>> harness part of the runner. Not to be confused with the
> > > centralized
> > > > > >> job
> > > > > >>>>> tracker (master) - completely unrelated. Worker harness, of
> > > course,
> > > > > >>>>> interacts with the master in some relevant ways (e.g.
> Dataflow
> > > > master
> > > > > >>> can
> > > > > >>>>> tell "you're a straggler, you should split").
> > > > > >>>>>
> > > > > >>>>> Q: Is this a new DoFn subclass, or how will this integrate
> with
> > > the
> > > > > >>>>> existing code?
> > > > > >>>>> A: It's a feature of reflection-based DoFn (
> > > > > >>> https://s.apache.org/a-new-do
> > > > > >>>>> fn)
> > > > > >>>>> - just another optional parameter of type RestrictionTracker
> to
> > > > > >>>>> processElement() which is dynamically bound via reflection,
> so
> > > > fully
> > > > > >>>>> backward/forward compatible, and looks to users like a
> regular
> > > > DoFn.
> > > > > >>>>>
> > > > > >>>>> Q: why is fractionClaimed a double?
> > > > > >>>>> A: It's a number in 0..1. Job-wide magic (e.g. autoscaling,
> > > dynamic
> > > > > >>>>> rebalancing) requires a uniform way to represent progress
> > through
> > > > > >>>>> different
> > > > > >>>>> sources.
> > > > > >>>>>
> > > > > >>>>> Q: Spark runner is microbatch-based, so this seems to map
> well
> > > onto
> > > > > >>>>> checkpoint/resume, right?
> > > > > >>>>> A: Yes; actually the Dataflow runner is, at a worker level,
> > also
> > > > > >>>>> microbatch-based. The way SDF interacts with a runner will be
> > > very
> > > > > >>> similar
> > > > > >>>>> to how a Bounded/UnboundedSource interacts with a runner.
> > > > > >>>>>
> > > > > >>>>> Q: Using SDF, what would be the "packaging" of the IO?
> > > > > >>>>> A: Same as currently: package IO's as PTransforms and their
> > > > > >>> implementation
> > > > > >>>>> under the hood can be anything: Source, simple ParDo's, SDF,
> > etc.
> > > > > E.g.
> > > > > >>>>> Datastore was recently refactored from BoundedSource to ParDo
> > > > (ended
> > > > > >> up
> > > > > >>>>> simpler and more scalable), transparently to users.
> > > > > >>>>>
> > > > > >>>>> Q: What's the timeline; what to do with the IOs currently in
> > > > > >>> development?
> > > > > >>>>> A: Timeline is O(months). Keep doing what you're doing and
> > > working
> > > > on
> > > > > >>> top
> > > > > >>>>> of Source APIs when necessary and simple ParDo's otherwise.
> > > > > >>>>>
> > > > > >>>>> Q: What's the impact for the runner writers?
> > > > > >>>>> A: Tentatively expected that most of the code for running an
> > SDF
> > > > will
> > > > > >> be
> > > > > >>>>> common to runners, with some amount of per-runner glue code,
> > just
> > > > > like
> > > > > >>>>> GBK/windowing/triggering. Impact on Dataflow runner is larger
> > > since
> > > > > it
> > > > > >>>>> supports dynamic rebalancing in batch mode and this is the
> > > hardest
> > > > > >> part,
> > > > > >>>>> but for other runners shouldn't be too hard.
> > > > > >>>>>
> > > > > >>>>> JB: Talend has people who can help with this: e.g. help
> > integrate
> > > > > into
> > > > > >>>>> Spark runner, refactor IOs etc. Amit also willing to chat
> about
> > > > > >>> supporting
> > > > > >>>>> SDF in Spark runner.
> > > > > >>>>>
> > > > > >>>>> Ismael: There's a Flink proposal about dynamic rebalancing.
> > > Ismael
> > > > > >> will
> > > > > >>>>> send a link.
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> On Fri, Aug 19, 2016 at 7:21 AM Jean-Baptiste Onofré <
> > > > > [email protected]
> > > > > >>>
> > > > > >>>>> wrote:
> > > > > >>>>>
> > > > > >>>>> Hi Eugene,
> > > > > >>>>>>
> > > > > >>>>>> thanks for the reminder.
> > > > > >>>>>>
> > > > > >>>>>> Just to prepare some topics for the call, please find some
> > > points:
> > > > > >>>>>>
> > > > > >>>>>> 1. Using SDF, what would be the "packaging" of the IO ? It
> > > sounds
> > > > to
> > > > > >> me
> > > > > >>>>>> that we can keep the IO packaging style (using with* setters
> > for
> > > > the
> > > > > >> IO
> > > > > >>>>>> configuration) and replace PTransform, Source, Reader, ...
> > > > directly
> > > > > >>> with
> > > > > >>>>>> SDF. Correct ?
> > > > > >>>>>>
> > > > > >>>>>> 2. What's your plan in term of release to include SDF ? We
> > have
> > > > > >> several
> > > > > >>>>>> IOs in preparation and I wonder if it's worth to start to
> use
> > > the
> > > > > new
> > > > > >>>>>> SDF API or not.
> > > > > >>>>>>
> > > > > >>>>>> 3. What's the impact for the runner writers ? The runners
> will
> > > > have
> > > > > >> to
> > > > > >>>>>> support SDF, that could be tricky depending of the execution
> > > > engine.
> > > > > >> In
> > > > > >>>>>> the worst case where the runner can't fully support SDF,
> does
> > it
> > > > > mean
> > > > > >>>>>> that most of our IOs will be useless ?
> > > > > >>>>>>
> > > > > >>>>>> Just my dumb topics ;)
> > > > > >>>>>>
> > > > > >>>>>> Thanks,
> > > > > >>>>>> See you at 8am !
> > > > > >>>>>>
> > > > > >>>>>> Regards
> > > > > >>>>>> JB
> > > > > >>>>>>
> > > > > >>>>>> On 08/19/2016 02:29 AM, Eugene Kirpichov wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Hello everybody,
> > > > > >>>>>>>
> > > > > >>>>>>> Just a reminder:
> > > > > >>>>>>>
> > > > > >>>>>>> The meeting is happening tomorrow - Friday Aug 19th,
> 8am-9am
> > > PST,
> > > > > to
> > > > > >>>>>>> join
> > > > > >>>>>>> the call go to
> > > > > >>>>>>>
> > > https://hangouts.google.com/hangouts/_/google.com/splittabledofn
> > > > .
> > > > > >>>>>>> I intend to go over the proposed design and then have a
> > > free-form
> > > > > >>>>>>> discussion.
> > > > > >>>>>>>
> > > > > >>>>>>> Please have a skim through the proposal doc:
> > > > https://s.apache.org/
> > > > > >>>>>>> splittable-do-fn
> > > > > >>>>>>> I also made some slides that are basically a trimmed-down
> > > version
> > > > > of
> > > > > >>> the
> > > > > >>>>>>> doc to use as a guide when conducting the meeting,
> > > > > >>>>>>>
> > > > > >>>>>>>
> https://docs.google.com/presentation/d/1DfCp9VRLH-5Ap5nbz047
> > > > > >>>>>> Fiv3NfjrywJVSdef5dRY4aE/edit?usp=sharing
> > > > > >>>>>>
> > > > > >>>>>>> .
> > > > > >>>>>>>
> > > > > >>>>>>> I will post notes from the meeting on this thread
> afterwards.
> > > > > >>>>>>>
> > > > > >>>>>>> Thanks, looking forward.
> > > > > >>>>>>>
> > > > > >>>>>>> On Fri, Aug 12, 2016 at 5:35 PM Dan Halperin
> > > > > >>>>>>> <[email protected]
> > > > > >>>>>>>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>> This is pretty cool! I'll be there too. (unless the hangout
> > > gets
> > > > > too
> > > > > >>>>>>>>
> > > > > >>>>>>> full
> > > > > >>>>>>
> > > > > >>>>>>> -- if so, I'll drop out in favor of others who aren't lucky
> > > > enough
> > > > > >> to
> > > > > >>>>>>>>
> > > > > >>>>>>> get
> > > > > >>>>>>
> > > > > >>>>>>> to talk to Eugene all the time.)
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Fri, Aug 12, 2016 at 4:03 PM, Andrew Psaltis <
> > > > > >>>>>>>>
> > > > > >>>>>>> [email protected]>
> > > > > >>>>>>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>> +1 I'll join
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On Friday, August 12, 2016, Aparup Banerjee (apbanerj) <
> > > > > >>>>>>>>>
> > > > > >>>>>>>> [email protected]
> > > > > >>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> wrote:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> + 1, me2
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On 8/12/16, 9:27 AM, "Amit Sela" <[email protected]
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>> <javascript:;>>
> > > > > >>>>>>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> +1 as in I'll join ;-)
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On Fri, Aug 12, 2016, 19:14 Eugene Kirpichov
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>> <[email protected]
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Sounds good, thanks!
> > > > > >>>>>>>>>>>> Then Friday Aug 19th it is, 8am-9am PST,
> > > > > >>>>>>>>>>>>
> https://staging.talkgadget.google.com/hangouts/_/google
> > .
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>> com/splittabledofn
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>> On Thu, Aug 11, 2016 at 11:12 PM Jean-Baptiste Onofré
> <
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>> [email protected]
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> <javascript:;>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Hi
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Unfortunately I will be in Ireland on August 15th.
> What
> > > > about
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Friday
> > > > > >>>>>>>>
> > > > > >>>>>>>>> 19th ?
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Regards
> > > > > >>>>>>>>>>>>> JB
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> On Aug 11, 2016, 23:22, at 23:22, Eugene Kirpichov
> > > > > >>>>>>>>>>>>> <[email protected]> wrote:
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Hi JB,
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Sounds great, does the suggested time over
> > > videoconference
> > > > > >> work
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> for
> > > > > >>>>>>>>
> > > > > >>>>>>>>> you?
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> On Thu, Aug 11, 2016 at 11:59 AM Jean-Baptiste
> Onofré
> > <
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> [email protected] <javascript:;>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Hi Eugene
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> May we talk together next week ? I like the
> > proposal. I
> > > > > >> would
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> just
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> need
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> some details for my understanding.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Thanks
> > > > > >>>>>>>>>>>>>>> Regards
> > > > > >>>>>>>>>>>>>>> JB
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> On Aug 11, 2016, 19:46, at 19:46, Eugene Kirpichov
> > > > > >>>>>>>>>>>>>>> <[email protected]> wrote:
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Hi JB,
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> What are your thoughts on this?
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> I'm also thinking of having a virtual meeting to
> > > explain
> > > > > >> more
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> about
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> this
> > > > > >>>>>>>>>>>>>>>> proposal if necessary, since I understand it is a
> > lot
> > > to
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> digest.
> > > > > >>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> How about: Monday Aug 15, 8am-9am Pacific time,
> over
> > > > > >>> Hangouts?
> > > > > >>>>>>>>>>>>>>>> (link:
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > https://staging.talkgadget.google.com/hangouts/_/google.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>> com/splittabledofn
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> -
> > > > > >>>>>>>>>>>>>>>> I confirmed that it can be joined without being
> > logged
> > > > > >> into a
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Google
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> account)
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Who'd be interested in attending, and does this
> > > > time/date
> > > > > >>> work
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> for
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> people?
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> On Fri, Aug 5, 2016 at 10:48 AM Eugene Kirpichov
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> <[email protected] <javascript:;>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Hi JB, thanks for reading and for your comments!
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> It sounds like you are concerned about continued
> > > > support
> > > > > >> for
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> existing
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> IO's
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> people have developed, and about backward
> > > > compatibility?
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> We do not need to remove the Source API, and all
> > > > existing
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Source-based
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> connectors will continue to work [though the
> > document
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> proposes
> > > > > >>>>>>>>
> > > > > >>>>>>>>> at
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> some
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> point to make Read.from(Source) to translate to a
> > > > wrapper
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> SDF
> > > > > >>>>>>>>
> > > > > >>>>>>>>> under
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> hood, to exercise the feature more and to make
> sure
> > > > that
> > > > > >> it
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> is
> > > > > >>>>>>>>
> > > > > >>>>>>>>> strictly
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> more powerful - but this is an optional
> > > implementation
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> detail].
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Perhaps the document phrases this too strongly -
> > > > > >> "replacing
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> Source
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> API": a better phrasing would be "introducing a
> new
> > > API
> > > > > so
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> powerful
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> and
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> easy-to-use that hopefully people will choose it
> > over
> > > > the
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Source
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> API
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> all
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> the time, even though they don't have to" :) And
> we
> > > can
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> discuss
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> whether or
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> not to actually deprecate/remove the Source API
> at
> > > some
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> point
> > > > > >>>>>>>>
> > > > > >>>>>>>>> down
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> road, once it becomes clear whether this is the
> > case
> > > or
> > > > > >> not.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> To give more context: this proposal came out of
> > > > > >> discussions
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> within
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> the SDK
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> team over the past ~1.5 years, before the Beam
> > > project
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> existed,
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> on
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> how to
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> make major improvements to the Source API;
> perhaps
> > it
> > > > > will
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> clarify
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> things
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> if I give a history of the ideas discussed:
> > > > > >>>>>>>>>>>>>>>>> - The first idea was to introduce a
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Read.from(PCollection<Source>)
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> transform while keeping the Source API intact - this,
> > given
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> appropriate
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> implementation, would solve most of the
> scalability
> > > and
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> composability
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> issues of IO's. Then most connectors would look
> like
> > :
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> ParDo<A,
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> Source<B>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> + Read.from().
> > > > > >>>>>>>>>>>>>>>>> - Then we figured that the Source class is an
> > > > unnecessary
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> abstraction, as
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> it simply holds data. What if we only had a
> > Reader<S,
> > > > B>
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> class
> > > > > >>>>>>>>
> > > > > >>>>>>>>> where
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> S is
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> the source type and B the output type? Then
> > > connectors
> > > > > >> would
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> be
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> something
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> like: ParDo<A, S> + hypothetical
> > Read.using(Reader<S,
> > > > > B>).
> > > > > >>>>>>>>>>>>>>>>> - Then somebody remarked that some of the
> features
> > of
> > > > > >> Source
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> are
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> useful to
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> ParDo's as well: e.g. ability to report progress
> > when
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> processing a
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> very
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> heavy element, or ability to produce very large
> > > output
> > > > in
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> parallel.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> - The two previous bullets were already hinting
> that
> > > the
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Read.using()
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> primitive might not be so special: it just takes S
> > and
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> produces
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> B:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> isn't
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> that what a ParDo does, plus some source magic,
> > minus
> > > > the
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> convenience
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> of
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> c.output() vs. the start/advance() state machine?
> > > > > >>>>>>>>>>>>>>>>> - At this point it became clear that we should
> > > explore
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> unifying
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> sources
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> and ParDo's, in particular: can we bring the
> magic
> > of
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> sources
> > > > > >>>>>>>>
> > > > > >>>>>>>>> to
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> ParDo's
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> but without the limitations and coding
> > > inconveniences?
> > > > > And
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> this
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> is
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> how
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> SplittableDoFn was born: bringing source magic
> to a
> > > > DoFn
> > > > > >> by
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> providing
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> a
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> RangeTracker.
> > > > > >>>>>>>>>>>>>>>>> - Once the idea of "splittable DoFn's" was born,
> it
> > > > > became
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> clear
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> that
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> it
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> is strictly more general than sources; at least,
> in
> > > the
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> respect
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> that
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> sources have to produce output, while DoFn's don't:
> > an
> > > > SDF
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> may
> > > > > >>>>>>>>
> > > > > >>>>>>>>> very
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> well
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> produce no output at all, and simply perform a
> side
> > > > > effect
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> in
> > > > > >>>>>>>>
> > > > > >>>>>>>>> a
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> parallel/resumable way.
> > > > > >>>>>>>>>>>>>>>>> - Then there were countless hours of discussions
> on
> > > > > >> unifying
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> bounded/unbounded cases, on the particulars of
> > RangeTracker
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> APIs
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> reconciling parallelization and checkpointing, what the
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> relation
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> between
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> SDF and DF should be, etc. They culminated in the
> > > > current
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> proposal.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> The
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> proposal comes at a time when a couple of key
> > > > ingredients
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> are
> > > > > >>>>>>>>
> > > > > >>>>>>>>> (almost)
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> ready: NewDoFn to make SDF look like a regular
> > DoFn,
> > > > and
> > > > > >> the
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> State/Timers
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> proposal to enable unbounded work per element.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> To put it shortly:
> > > > > >>>>>>>>>>>>>>>>> - Yes, we will support existing Source
> connectors,
> > > and
> > > > > >> will
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> support
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> writing new ones, possibly forever. There is no
> > > > > interference
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> with
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> current
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> users of Source.
> > > > > >>>>>>>>>>>>>>>>> - The new API is an attempt to improve the Source
> > > API,
> > > > > >> taken
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> to
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> its
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> logical limit where it turns out that users' goals
> > can
> > > be
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> accomplished
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> easier and more generically entirely within
> > ParDo's.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Let me know what you think, and thanks again!
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> On Fri, Aug 5, 2016 at 2:39 AM Jean-Baptiste
> Onofré
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> <[email protected] <javascript:;>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Hi Eugene,
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> Just a question: why is it in DoFn and note an
> > > > > >> improvement
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> of
> > > > > >>>>>>>>
> > > > > >>>>>>>>> Source
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> ?
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> If I understand correctly, it means that we will
> > > have
> > > > to
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> refactore
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> all
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> existing IO: basically, what you propose is to
> > remove
> > > > all
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Source
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> replace with NewDoFn.
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> I'm concern with this approach, especially in
> term
> > > of
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> timing:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> clearly,
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> the IO is the area where we have to move forward
> in
> > > > Beam
> > > > > >> as
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> it
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> will
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> allow new users to start in their projects.
> > > > > >>>>>>>>>>>>>>>>>> So, we started to bring new IOs: Kafka, JMS,
> > > > Cassandra,
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> MongoDB,
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> JDBC,
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> ... and some people started to learn the IO API
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> (Bounded/Unbouded
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> source, etc).
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> I think it would make more sense to enhance the
> IO
> > > API
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> (Source)
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> instead
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> of introducing a NewDoFn.
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> What are your thoughts for IO writer like me ?
> ;)
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> Regards
> > > > > >>>>>>>>>>>>>>>>>> JB
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> On 08/04/2016 07:45 PM, Eugene Kirpichov wrote:
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> Hello Beam community,
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> We (myself, Daniel Mills and Robert Bradshaw)
> > would
> > > > > like
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> to
> > > > > >>>>>>>>
> > > > > >>>>>>>>> propose
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> "Splittable DoFn" - a major generalization of
> DoFn,
> > > > which
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> allows
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> processing
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> of a single element to be non-monolithic, i.e.
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> checkpointable
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> and
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> parallelizable, as well as doing an unbounded
> amount
> > of
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> work
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> per
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> element.
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> This allows effectively replacing the current
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> Bounded/UnboundedSource
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> APIs
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> with DoFn's that are much easier to code, more
> > > > scalable
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> and
> > > > > >>>>>>>>
> > > > > >>>>>>>>> composable
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> with
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> the rest of the Beam programming model, and
> > enables
> > > > > many
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> use
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> cases
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> that
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> were previously difficult or impossible, as well
> as
> > > > some
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> non-obvious new
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> use cases.
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> This proposal has been mentioned before in JIRA
> > > > > >> [BEAM-65]
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> and
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> some
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Beam
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> meetings, and now the whole thing is written up
> in
> > a
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> document:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>         https://s.apache.org/splittable-do-fn
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> Here are some things that become possible with
> > > > > >> Splittable
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> DoFn:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> - Efficiently read a filepattern matching millions of
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> files
> > > > > >>>>>>>>
> > > > > >>>>>>>>> - Read a collection of files that are produced by an
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> earlier
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> step
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> in the
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> pipeline (e.g. easily implement a connector to a
> > > > storage
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> system
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> that can
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> export itself to files)
> > > > > >>>>>>>>>>>>>>>>>>> - Implement a Kafka reader by composing a "list
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> partitions"
> > > > > >>>>>>>>
> > > > > >>>>>>>>> DoFn
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> with a
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> DoFn that simply polls a consumer and outputs new
> > > > records
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> in
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> a
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> while()
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> loop
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> - Implement a log tailer by composing a DoFn
> that
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> incrementally
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> returns
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> new
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> files in a directory and a DoFn that tails a
> file
> > > > > >>>>>>>>>>>>>>>>>>> - Implement a parallel "count friends in
> common"
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> algorithm
> > > > > >>>>>>>>
> > > > > >>>>>>>>> (matrix
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> squaring) with good work balancing
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> Here is the meaningful part of a hypothetical
> > Kafka
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> reader
> > > > > >>>>>>>>
> > > > > >>>>>>>>> written
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> against
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> this API:
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>     ProcessContinuation processElement(
> > > > > >>>>>>>>>>>>>>>>>>>             ProcessContext context,
> > > > OffsetRangeTracker
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> tracker)
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> {
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>       try (KafkaConsumer<String, String> consumer =
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > >  Kafka.subscribe(context.element().topic,
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>  context.element().partition)) {
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>>         consumer.seek(tracker.start());
> > > > > >>>>>>>>>>>>>>>>>>>         while (true) {
> > > > > >>>>>>>>>>>>>>>>>>>           ConsumerRecords<String, String>
> > records =
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> consumer.poll(100ms);
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>           if (records == null) return done();
> > > > > >>>>>>>>>>>>>>>>>>>           for (ConsumerRecord<String, String>
> > > record
> > > > :
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> records)
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> {
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>             if
> (!tracker.tryClaim(record.offset())) {
> > > > > >>>>>>>>>>>>>>>>>>>               return
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>
> > > > resume().withFutureOutputWatermark(record.timestamp());
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>             }
> > > > > >>>>>>>>>>>>>>>>>>>             context.output(record);
> > > > > >>>>>>>>>>>>>>>>>>>           }
> > > > > >>>>>>>>>>>>>>>>>>>         }
> > > > > >>>>>>>>>>>>>>>>>>>       }
> > > > > >>>>>>>>>>>>>>>>>>>     }
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> The document describes in detail the
> motivations
> > > > behind
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> this
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> feature,
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> basic idea and API, open questions, and
> outlines
> > an
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> incremental
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> delivery
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> plan.
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> The proposed API builds on the reflection-based
> > new
> > > > > DoFn
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> [new-do-fn]
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> and is
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> loosely related to "State and Timers for DoFn"
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> [beam-state].
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

Reply via email to