Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

Ovidiu-Cristian MARCU Mon, 29 Aug 2016 05:50:28 -0700

Hi everyone

Is there any repository where one can track all proposals, something like Flink 
does with this wiki [1]?


[1] 
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals 
<https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals>

Thanks
Ovidiu

> On 29 Aug 2016, at 12:01, Jean-Baptiste Onofré <[email protected]> wrote:
> 
> Hi Aljoscha,
> 
> Indeed, it's something we discussed during our call.
> 
> AFAIU, it's one function of the tracker. When doing the tracker tryClaim 
> (with offset, partition id, or any kind of tracked "ID"), if the claim is not 
> possible, then we will update the watermark.
> 
> So the tracker is useful to determine the "split" and also to deal with 
> watermark.
> 
> Regards
> JB
> 
> On 08/29/2016 11:55 AM, Aljoscha Krettek wrote:
>> Hi,
>> I have another question about this: currently, unbounded sources have
>> special logic for determining the watermark and the system periodically
>> asks the sources for the current watermark. As I understood it, watermarks
>> are only "generated" at the sources. How will this work when sources are
>> implemented as a combination of DoFns and SplittableDoFns? Will
>> SplittableDoFns be asked for a watermark, does this mean that watermarks
>> can then be "generated" at any operation?
>> 
>> Cheers,
>> Aljoscha
>> 
>> On Sun, 21 Aug 2016 at 22:34 Eugene Kirpichov <[email protected]>
>> wrote:
>> 
>>> Hi JB,
>>> 
>>> Yes, I'm assuming you're referring to the "magic" part on the transform
>>> expansion diagram. This is indeed runner-specific, and timers+state are
>>> likely the simplest way to do this for an SDF that does unbounded amount of
>>> work.
>>> 
>>> On Sun, Aug 21, 2016 at 12:14 PM Jean-Baptiste Onofré <[email protected]>
>>> wrote:
>>> 
>>>> Anyway, from a runner perspective, we will have kind of API (part of the
>>>> Runner API) to "orchestrate" the SDF as we discussed during the call,
>>>> right ?
>>>> 
>>>> Regards
>>>> JB
>>>> 
>>>> On 08/21/2016 07:24 PM, Eugene Kirpichov wrote:
>>>>> Hi Aljoscha,
>>>>> This is an excellent question! And the answer is, we don't need any new
>>>>> concepts like "SDF executor" and can rely on the per-key state and
>>> timers
>>>>> machinery that already exists in all runners because it's necessary to
>>>>> implement windowing/triggering properly.
>>>>> 
>>>>> Note that this is already somewhat addressed in the previously posted
>>>> State
>>>>> and Timers proposal https://s.apache.org/beam-state , under "per-key
>>>>> workflows".
>>>>> 
>>>>> Think of it this way, using the Kafka example: we'll expand it into a
>>>>> transform:
>>>>> 
>>>>> (1) ParDo { topic -> (unique key, topic, partition, [0, inf))) for
>>>>> partition in topic.listPartitions() }
>>>>> (2) GroupByKey
>>>>> (3) ParDo { key, topic, partition, R -> Kafka reader code in the
>>>>> proposal/slides }
>>>>>  - R is the OffsetRange restriction which in this case will be always
>>> of
>>>>> the form [startOffset, inf).
>>>>>  - there'll be just 1 value per key, but we use GBK to just get access
>>>> to
>>>>> the per-key state/timers machinery. This may be runner-specific; maybe
>>>> some
>>>>> runners don't need a GBK to do that.
>>>>> 
>>>>> Now suppose the topic has two partitions, P1 and P2, and they get
>>>> assigned
>>>>> unique keys K1, K2.
>>>>> Then the input to (3) will be a collection of: (K1, topic, P1, [0,
>>> inf)),
>>>>> (K2, topic, P2, [0, inf)).
>>>>> Suppose we have just 1 worker with just 1 thread. Now, how will this
>>>> thread
>>>>> be able to produce elements from both P1 and P2? here's how.
>>>>> 
>>>>> The thread will process (K1, topic, P1, [0, inf)), checkpoint after a
>>>>> certain time or after a certain number of elements are output (just
>>> like
>>>>> with the current UnboundedSource reading code) producing a residual
>>>>> restriction R1' (basically a new start timestamp), put R11 into the
>>>> per-key
>>>>> state and set a timer T1 to resume.
>>>>> Then it will process (K2, topic, P2, [0, inf)), do the same producing a
>>>>> residual restriction R2' and setting a timer T2 to resume.
>>>>> Then timer T1 will fire in the context of the key K1. The thread will
>>>> call
>>>>> processElement again, this time supplying R1' as the restriction; the
>>>>> process repeats and after a while it checkpoints and stores R1'' into
>>>> state
>>>>> of K1.
>>>>> Then timer T2 will fire in the context of K2, run processElement for a
>>>>> while, set a new timer and store R2'' into the state of K2.
>>>>> Etc.
>>>>> If partition 1 goes away, the processElement call will return "do not
>>>>> resume", so a timer will not be set and instead the state associated
>>> with
>>>>> K1 will be GC'd.
>>>>> 
>>>>> So basically it's almost like cooperative thread scheduling: things run
>>>> for
>>>>> a while, until the runner tells them to checkpoint, then they set a
>>> timer
>>>>> to resume themselves, and the runner fires the timers, and the process
>>>>> repeats. And, again, this only requires things that runners can already
>>>> do
>>>>> - state and timers, but no new concept of SDF executor (and
>>> consequently
>>>> no
>>>>> necessity to choose/tune how many you need).
>>>>> 
>>>>> Makes sense?
>>>>> 
>>>>> On Sat, Aug 20, 2016 at 9:46 AM Aljoscha Krettek <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> I have another question that I think wasn't addressed in the meeting.
>>> At
>>>>>> least it wasn't mentioned in the notes.
>>>>>> 
>>>>>> In the context of replacing sources by a combination of to SDFs, how
>>> do
>>>> you
>>>>>> determine how many "SDF executor" instances you need downstream? For
>>> the
>>>>>> sake of argument assume that both SDFs are executed with parallelism 1
>>>> (or
>>>>>> one per worker). Now, if you have a file source that reads from a
>>> static
>>>>>> set of files the first SDF would emit the filenames while the second
>>> SDF
>>>>>> would receive the filenames and emit their contents. This works well
>>> and
>>>>>> the downstream SDF can process one filename after the other. Now,
>>> think
>>>> of
>>>>>> something like a Kafka source. The first SDF would emit the partitions
>>>> (say
>>>>>> 4 partitions, in this example) and the second SDF would be responsible
>>>> for
>>>>>> reading from a topic and emitting elements. Reading from one topic
>>> never
>>>>>> finishes so you can't process the topics in series. I think you would
>>>> need
>>>>>> to have 4 downstream "SDF executor" instances. The question now is:
>>> how
>>>> do
>>>>>> you determine whether you are in the first or the second situation?
>>>>>> 
>>>>>> Probably I'm just overlooking something and this is already dealt with
>>>>>> somewhere... :-)
>>>>>> 
>>>>>> Cheers,
>>>>>> Aljoscha
>>>>>> 
>>>>>> On Fri, 19 Aug 2016 at 21:02 Ismaël Mejía <[email protected]> wrote:
>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> Thanks for the notes both Dan and Eugene, and for taking the time to
>>> do
>>>>>> the
>>>>>>> presentation and  answer our questions.
>>>>>>> 
>>>>>>> I mentioned the ongoing work on dynamic scaling on Flink because I
>>>>>> suppose
>>>>>>> that it will address dynamic rebalancing eventually (there are
>>> multiple
>>>>>>> changes going on for dynamic scaling).
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> https://docs.google.com/document/d/1G1OS1z3xEBOrYD4wSu-LuBCyPUWyFd9l3T9WyssQ63w/edit#heading=h.2suhu1fjxmp4
>>>>>>> 
>>>>>>> 
>>> https://lists.apache.org/[email protected]:lte=1M:FLIP-8
>>>>>>> 
>>>>>>> Anyway I am far from an expert on flink, but probably the flink guys
>>>> can
>>>>>>> give their opinion about this and refer to a more precise document
>>> that
>>>>>> the
>>>>>>> ones I mentioned..
>>>>>>> 
>>>>>>> Thanks again,
>>>>>>> Ismaël
>>>>>>> 
>>>>>>> On Fri, Aug 19, 2016 at 8:52 PM, Jean-Baptiste Onofré <
>>> [email protected]
>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Great summary Eugene and Dan.
>>>>>>>> 
>>>>>>>> And thanks again for the details, explanation, and discussion.
>>>>>>>> 
>>>>>>>> Regards
>>>>>>>> JB
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 08/19/2016 08:16 PM, Eugene Kirpichov wrote:
>>>>>>>> 
>>>>>>>>> Thanks for attending, everybody!
>>>>>>>>> 
>>>>>>>>> Here are meeting notes (thanks Dan!).
>>>>>>>>> 
>>>>>>>>> Q: Will SplittableDoFn enable better repartitioning of the
>>>>>> input/output
>>>>>>>>> data?
>>>>>>>>> A: Not really; repartitioning is orthogonal to SDF.
>>>>>>>>> 
>>>>>>>>> Current Source API suffers from lack of composition and scalability
>>>>>>>>> because
>>>>>>>>> we treat sources too much as metadata, not enough as data.
>>>>>>>>> 
>>>>>>>>> Q(slide with transform expansion): who does the "magic"?
>>>>>>>>> A: The runner. Checkpointing and dynamically splitting restrictions
>>>>>> will
>>>>>>>>> require collaboration with the runner.
>>>>>>>>> 
>>>>>>>>> Q: How does the runner interact with the DoFn to control the
>>>>>>> restrictions?
>>>>>>>>> Is it related to the centralized job tracker etc.?
>>>>>>>>> A: RestrictionTracker is a simple helper object, that exists purely
>>>> on
>>>>>>> the
>>>>>>>>> worker while executing a single partition, and interacts with the
>>>>>> worker
>>>>>>>>> harness part of the runner. Not to be confused with the centralized
>>>>>> job
>>>>>>>>> tracker (master) - completely unrelated. Worker harness, of course,
>>>>>>>>> interacts with the master in some relevant ways (e.g. Dataflow
>>> master
>>>>>>> can
>>>>>>>>> tell "you're a straggler, you should split").
>>>>>>>>> 
>>>>>>>>> Q: Is this a new DoFn subclass, or how will this integrate with the
>>>>>>>>> existing code?
>>>>>>>>> A: It's a feature of reflection-based DoFn (
>>>>>>> https://s.apache.org/a-new-do
>>>>>>>>> fn)
>>>>>>>>> - just another optional parameter of type RestrictionTracker to
>>>>>>>>> processElement() which is dynamically bound via reflection, so
>>> fully
>>>>>>>>> backward/forward compatible, and looks to users like a regular
>>> DoFn.
>>>>>>>>> 
>>>>>>>>> Q: why is fractionClaimed a double?
>>>>>>>>> A: It's a number in 0..1. Job-wide magic (e.g. autoscaling, dynamic
>>>>>>>>> rebalancing) requires a uniform way to represent progress through
>>>>>>>>> different
>>>>>>>>> sources.
>>>>>>>>> 
>>>>>>>>> Q: Spark runner is microbatch-based, so this seems to map well onto
>>>>>>>>> checkpoint/resume, right?
>>>>>>>>> A: Yes; actually the Dataflow runner is, at a worker level, also
>>>>>>>>> microbatch-based. The way SDF interacts with a runner will be very
>>>>>>> similar
>>>>>>>>> to how a Bounded/UnboundedSource interacts with a runner.
>>>>>>>>> 
>>>>>>>>> Q: Using SDF, what would be the "packaging" of the IO?
>>>>>>>>> A: Same as currently: package IO's as PTransforms and their
>>>>>>> implementation
>>>>>>>>> under the hood can be anything: Source, simple ParDo's, SDF, etc.
>>>> E.g.
>>>>>>>>> Datastore was recently refactored from BoundedSource to ParDo
>>> (ended
>>>>>> up
>>>>>>>>> simpler and more scalable), transparently to users.
>>>>>>>>> 
>>>>>>>>> Q: What's the timeline; what to do with the IOs currently in
>>>>>>> development?
>>>>>>>>> A: Timeline is O(months). Keep doing what you're doing and working
>>> on
>>>>>>> top
>>>>>>>>> of Source APIs when necessary and simple ParDo's otherwise.
>>>>>>>>> 
>>>>>>>>> Q: What's the impact for the runner writers?
>>>>>>>>> A: Tentatively expected that most of the code for running an SDF
>>> will
>>>>>> be
>>>>>>>>> common to runners, with some amount of per-runner glue code, just
>>>> like
>>>>>>>>> GBK/windowing/triggering. Impact on Dataflow runner is larger since
>>>> it
>>>>>>>>> supports dynamic rebalancing in batch mode and this is the hardest
>>>>>> part,
>>>>>>>>> but for other runners shouldn't be too hard.
>>>>>>>>> 
>>>>>>>>> JB: Talend has people who can help with this: e.g. help integrate
>>>> into
>>>>>>>>> Spark runner, refactor IOs etc. Amit also willing to chat about
>>>>>>> supporting
>>>>>>>>> SDF in Spark runner.
>>>>>>>>> 
>>>>>>>>> Ismael: There's a Flink proposal about dynamic rebalancing. Ismael
>>>>>> will
>>>>>>>>> send a link.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Aug 19, 2016 at 7:21 AM Jean-Baptiste Onofré <
>>>> [email protected]
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Eugene,
>>>>>>>>>> 
>>>>>>>>>> thanks for the reminder.
>>>>>>>>>> 
>>>>>>>>>> Just to prepare some topics for the call, please find some points:
>>>>>>>>>> 
>>>>>>>>>> 1. Using SDF, what would be the "packaging" of the IO ? It sounds
>>> to
>>>>>> me
>>>>>>>>>> that we can keep the IO packaging style (using with* setters for
>>> the
>>>>>> IO
>>>>>>>>>> configuration) and replace PTransform, Source, Reader, ...
>>> directly
>>>>>>> with
>>>>>>>>>> SDF. Correct ?
>>>>>>>>>> 
>>>>>>>>>> 2. What's your plan in term of release to include SDF ? We have
>>>>>> several
>>>>>>>>>> IOs in preparation and I wonder if it's worth to start to use the
>>>> new
>>>>>>>>>> SDF API or not.
>>>>>>>>>> 
>>>>>>>>>> 3. What's the impact for the runner writers ? The runners will
>>> have
>>>>>> to
>>>>>>>>>> support SDF, that could be tricky depending of the execution
>>> engine.
>>>>>> In
>>>>>>>>>> the worst case where the runner can't fully support SDF, does it
>>>> mean
>>>>>>>>>> that most of our IOs will be useless ?
>>>>>>>>>> 
>>>>>>>>>> Just my dumb topics ;)
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> See you at 8am !
>>>>>>>>>> 
>>>>>>>>>> Regards
>>>>>>>>>> JB
>>>>>>>>>> 
>>>>>>>>>> On 08/19/2016 02:29 AM, Eugene Kirpichov wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hello everybody,
>>>>>>>>>>> 
>>>>>>>>>>> Just a reminder:
>>>>>>>>>>> 
>>>>>>>>>>> The meeting is happening tomorrow - Friday Aug 19th, 8am-9am PST,
>>>> to
>>>>>>>>>>> join
>>>>>>>>>>> the call go to
>>>>>>>>>>> https://hangouts.google.com/hangouts/_/google.com/splittabledofn
>>> .
>>>>>>>>>>> I intend to go over the proposed design and then have a free-form
>>>>>>>>>>> discussion.
>>>>>>>>>>> 
>>>>>>>>>>> Please have a skim through the proposal doc:
>>> https://s.apache.org/
>>>>>>>>>>> splittable-do-fn
>>>>>>>>>>> I also made some slides that are basically a trimmed-down version
>>>> of
>>>>>>> the
>>>>>>>>>>> doc to use as a guide when conducting the meeting,
>>>>>>>>>>> 
>>>>>>>>>>> https://docs.google.com/presentation/d/1DfCp9VRLH-5Ap5nbz047
>>>>>>>>>> Fiv3NfjrywJVSdef5dRY4aE/edit?usp=sharing
>>>>>>>>>> 
>>>>>>>>>>> .
>>>>>>>>>>> 
>>>>>>>>>>> I will post notes from the meeting on this thread afterwards.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks, looking forward.
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Aug 12, 2016 at 5:35 PM Dan Halperin
>>>>>>>>>>> <[email protected]
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> This is pretty cool! I'll be there too. (unless the hangout gets
>>>> too
>>>>>>>>>>>> 
>>>>>>>>>>> full
>>>>>>>>>> 
>>>>>>>>>>> -- if so, I'll drop out in favor of others who aren't lucky
>>> enough
>>>>>> to
>>>>>>>>>>>> 
>>>>>>>>>>> get
>>>>>>>>>> 
>>>>>>>>>>> to talk to Eugene all the time.)
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Aug 12, 2016 at 4:03 PM, Andrew Psaltis <
>>>>>>>>>>>> 
>>>>>>>>>>> [email protected]>
>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> +1 I'll join
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Friday, August 12, 2016, Aparup Banerjee (apbanerj) <
>>>>>>>>>>>>> 
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> + 1, me2
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 8/12/16, 9:27 AM, "Amit Sela" <[email protected]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> <javascript:;>>
>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> +1 as in I'll join ;-)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, Aug 12, 2016, 19:14 Eugene Kirpichov
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> <[email protected]
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sounds good, thanks!
>>>>>>>>>>>>>>>> Then Friday Aug 19th it is, 8am-9am PST,
>>>>>>>>>>>>>>>> https://staging.talkgadget.google.com/hangouts/_/google.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> com/splittabledofn
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, Aug 11, 2016 at 11:12 PM Jean-Baptiste Onofré <
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> <javascript:;>>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Unfortunately I will be in Ireland on August 15th. What
>>> about
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Friday
>>>>>>>>>>>> 
>>>>>>>>>>>>> 19th ?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>> JB
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Aug 11, 2016, 23:22, at 23:22, Eugene Kirpichov
>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi JB,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Sounds great, does the suggested time over videoconference
>>>>>> work
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>> 
>>>>>>>>>>>>> you?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Thu, Aug 11, 2016 at 11:59 AM Jean-Baptiste Onofré <
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> [email protected] <javascript:;>>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Eugene
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> May we talk together next week ? I like the proposal. I
>>>>>> would
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> some details for my understanding.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>> JB
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Aug 11, 2016, 19:46, at 19:46, Eugene Kirpichov
>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi JB,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> What are your thoughts on this?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I'm also thinking of having a virtual meeting to explain
>>>>>> more
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>> proposal if necessary, since I understand it is a lot to
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> digest.
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> How about: Monday Aug 15, 8am-9am Pacific time, over
>>>>>>> Hangouts?
>>>>>>>>>>>>>>>>>>>> (link:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> https://staging.talkgadget.google.com/hangouts/_/google.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> com/splittabledofn
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>> I confirmed that it can be joined without being logged
>>>>>> into a
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Google
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> account)
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Who'd be interested in attending, and does this
>>> time/date
>>>>>>> work
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> people?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Fri, Aug 5, 2016 at 10:48 AM Eugene Kirpichov
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> <[email protected] <javascript:;>>
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi JB, thanks for reading and for your comments!
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> It sounds like you are concerned about continued
>>> support
>>>>>> for
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> IO's
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> people have developed, and about backward
>>> compatibility?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> We do not need to remove the Source API, and all
>>> existing
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Source-based
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> connectors will continue to work [though the document
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> proposes
>>>>>>>>>>>> 
>>>>>>>>>>>>> at
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> point to make Read.from(Source) to translate to a
>>> wrapper
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> SDF
>>>>>>>>>>>> 
>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> hood, to exercise the feature more and to make sure
>>> that
>>>>>> it
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>> 
>>>>>>>>>>>>> strictly
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> more powerful - but this is an optional implementation
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> detail].
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Perhaps the document phrases this too strongly -
>>>>>> "replacing
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Source
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> API": a better phrasing would be "introducing a new API
>>>> so
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> powerful
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> easy-to-use that hopefully people will choose it over
>>> the
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Source
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> API
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> the time, even though they don't have to" :) And we can
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> discuss
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> whether or
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> not to actually deprecate/remove the Source API at some
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> point
>>>>>>>>>>>> 
>>>>>>>>>>>>> down
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> road, once it becomes clear whether this is the case or
>>>>>> not.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> To give more context: this proposal came out of
>>>>>> discussions
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> within
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> the SDK
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> team over the past ~1.5 years, before the Beam project
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> existed,
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> how to
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> make major improvements to the Source API; perhaps it
>>>> will
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> clarify
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> things
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> if I give a history of the ideas discussed:
>>>>>>>>>>>>>>>>>>>>> - The first idea was to introduce a
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Read.from(PCollection<Source>)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> transform while keeping the Source API intact - this, given
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> appropriate
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> implementation, would solve most of the scalability and
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> composability
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> issues of IO's. Then most connectors would look like :
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> ParDo<A,
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Source<B>>
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> + Read.from().
>>>>>>>>>>>>>>>>>>>>> - Then we figured that the Source class is an
>>> unnecessary
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> abstraction, as
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> it simply holds data. What if we only had a Reader<S,
>>> B>
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>> 
>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> S is
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> the source type and B the output type? Then connectors
>>>>>> would
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> like: ParDo<A, S> + hypothetical Read.using(Reader<S,
>>>> B>).
>>>>>>>>>>>>>>>>>>>>> - Then somebody remarked that some of the features of
>>>>>> Source
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> useful to
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> ParDo's as well: e.g. ability to report progress when
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> processing a
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> heavy element, or ability to produce very large output
>>> in
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> parallel.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> - The two previous bullets were already hinting that the
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Read.using()
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> primitive might not be so special: it just takes S and
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> produces
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> B:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> isn't
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> that what a ParDo does, plus some source magic, minus
>>> the
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> convenience
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> c.output() vs. the start/advance() state machine?
>>>>>>>>>>>>>>>>>>>>> - At this point it became clear that we should explore
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> unifying
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> sources
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> and ParDo's, in particular: can we bring the magic of
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> sources
>>>>>>>>>>>> 
>>>>>>>>>>>>> to
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ParDo's
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> but without the limitations and coding inconveniences?
>>>> And
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> how
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> SplittableDoFn was born: bringing source magic to a
>>> DoFn
>>>>>> by
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> providing
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> RangeTracker.
>>>>>>>>>>>>>>>>>>>>> - Once the idea of "splittable DoFn's" was born, it
>>>> became
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> clear
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> is strictly more general than sources; at least, in the
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> respect
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> sources have to produce output, while DoFn's don't: an
>>> SDF
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>> 
>>>>>>>>>>>>> very
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> well
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> produce no output at all, and simply perform a side
>>>> effect
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>> 
>>>>>>>>>>>>> a
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> parallel/resumable way.
>>>>>>>>>>>>>>>>>>>>> - Then there were countless hours of discussions on
>>>>>> unifying
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> bounded/unbounded cases, on the particulars of RangeTracker
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> APIs
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> reconciling parallelization and checkpointing, what the
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> relation
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> SDF and DF should be, etc. They culminated in the
>>> current
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> proposal comes at a time when a couple of key
>>> ingredients
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>> 
>>>>>>>>>>>>> (almost)
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> ready: NewDoFn to make SDF look like a regular DoFn,
>>> and
>>>>>> the
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> State/Timers
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> proposal to enable unbounded work per element.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> To put it shortly:
>>>>>>>>>>>>>>>>>>>>> - Yes, we will support existing Source connectors, and
>>>>>> will
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> support
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> writing new ones, possibly forever. There is no
>>>> interference
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> users of Source.
>>>>>>>>>>>>>>>>>>>>> - The new API is an attempt to improve the Source API,
>>>>>> taken
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> logical limit where it turns out that users' goals can be
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> accomplished
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> easier and more generically entirely within ParDo's.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Let me know what you think, and thanks again!
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Fri, Aug 5, 2016 at 2:39 AM Jean-Baptiste Onofré
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> <[email protected] <javascript:;>>
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi Eugene,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Just a question: why is it in DoFn and note an
>>>>>> improvement
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>> 
>>>>>>>>>>>>> Source
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> ?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> If I understand correctly, it means that we will have
>>> to
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> refactore
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> existing IO: basically, what you propose is to remove
>>> all
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Source
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> replace with NewDoFn.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I'm concern with this approach, especially in term of
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> timing:
>>>>>>>>>>>> 
>>>>>>>>>>>>> clearly,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> the IO is the area where we have to move forward in
>>> Beam
>>>>>> as
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> allow new users to start in their projects.
>>>>>>>>>>>>>>>>>>>>>> So, we started to bring new IOs: Kafka, JMS,
>>> Cassandra,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> MongoDB,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> JDBC,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> ... and some people started to learn the IO API
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> (Bounded/Unbouded
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> source, etc).
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I think it would make more sense to enhance the IO API
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> (Source)
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> instead
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> of introducing a NewDoFn.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> What are your thoughts for IO writer like me ? ;)
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>> JB
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On 08/04/2016 07:45 PM, Eugene Kirpichov wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hello Beam community,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> We (myself, Daniel Mills and Robert Bradshaw) would
>>>> like
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>> 
>>>>>>>>>>>>> propose
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> "Splittable DoFn" - a major generalization of DoFn,
>>> which
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> allows
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> of a single element to be non-monolithic, i.e.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> checkpointable
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> parallelizable, as well as doing an unbounded amount of
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> work
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> per
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> element.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> This allows effectively replacing the current
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Bounded/UnboundedSource
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> APIs
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> with DoFn's that are much easier to code, more
>>> scalable
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>> 
>>>>>>>>>>>>> composable
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> the rest of the Beam programming model, and enables
>>>> many
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> cases
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> were previously difficult or impossible, as well as
>>> some
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> non-obvious new
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> use cases.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> This proposal has been mentioned before in JIRA
>>>>>> [BEAM-65]
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Beam
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> meetings, and now the whole thing is written up in a
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> document:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>        https://s.apache.org/splittable-do-fn
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Here are some things that become possible with
>>>>>> Splittable
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> DoFn:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> - Efficiently read a filepattern matching millions of
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>> 
>>>>>>>>>>>>> - Read a collection of files that are produced by an
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> earlier
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> step
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> in the
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> pipeline (e.g. easily implement a connector to a
>>> storage
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> system
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> that can
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> export itself to files)
>>>>>>>>>>>>>>>>>>>>>>> - Implement a Kafka reader by composing a "list
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> partitions"
>>>>>>>>>>>> 
>>>>>>>>>>>>> DoFn
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> with a
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> DoFn that simply polls a consumer and outputs new
>>> records
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> while()
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> loop
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> - Implement a log tailer by composing a DoFn that
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> incrementally
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> returns
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> new
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> files in a directory and a DoFn that tails a file
>>>>>>>>>>>>>>>>>>>>>>> - Implement a parallel "count friends in common"
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> algorithm
>>>>>>>>>>>> 
>>>>>>>>>>>>> (matrix
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> squaring) with good work balancing
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Here is the meaningful part of a hypothetical Kafka
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> reader
>>>>>>>>>>>> 
>>>>>>>>>>>>> written
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> against
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> this API:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>    ProcessContinuation processElement(
>>>>>>>>>>>>>>>>>>>>>>>            ProcessContext context,
>>> OffsetRangeTracker
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> tracker)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>      try (KafkaConsumer<String, String> consumer =
>>>>>>>>>>>>>>>>>>>>>>> 
>>>> Kafka.subscribe(context.element().topic,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> context.element().partition)) {
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>        consumer.seek(tracker.start());
>>>>>>>>>>>>>>>>>>>>>>>        while (true) {
>>>>>>>>>>>>>>>>>>>>>>>          ConsumerRecords<String, String> records =
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> consumer.poll(100ms);
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>          if (records == null) return done();
>>>>>>>>>>>>>>>>>>>>>>>          for (ConsumerRecord<String, String> record
>>> :
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> records)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>            if (!tracker.tryClaim(record.offset())) {
>>>>>>>>>>>>>>>>>>>>>>>              return
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>> resume().withFutureOutputWatermark(record.timestamp());
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>            }
>>>>>>>>>>>>>>>>>>>>>>>            context.output(record);
>>>>>>>>>>>>>>>>>>>>>>>          }
>>>>>>>>>>>>>>>>>>>>>>>        }
>>>>>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>>>>>>    }
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> The document describes in detail the motivations
>>> behind
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> feature,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> basic idea and API, open questions, and outlines an
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> incremental
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> delivery
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> plan.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> The proposed API builds on the reflection-based new
>>>> DoFn
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> [new-do-fn]
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> and is
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> loosely related to "State and Timers for DoFn"
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> [beam-state].
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Please take a look and comment!
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> [BEAM-65]
>>>> https://issues.apache.org/jira/browse/BEAM-65
>>>>>>>>>>>>>>>>>>>>>>> [new-do-fn] https://s.apache.org/a-new-do-fn
>>>>>>>>>>>>>>>>>>>>>>> [beam-state] https://s.apache.org/beam-state
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> Jean-Baptiste Onofré
>>>>>>>>>>>>>>>>>>>>>> [email protected] <javascript:;>
>>>>>>>>>>>>>>>>>>>>>> http://blog.nanthrax.net
>>>>>>>>>>>>>>>>>>>>>> Talend - http://www.talend.com
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Subscribe to my book: Streaming Data <
>>> http://manning.com/psaltis
>>>>> 
>>>>>>>>>>>>> <https://www.linkedin.com/pub/andrew-psaltis/1/17b/306>
>>>>>>>>>>>>> twiiter: @itmdata <
>>>>>>> http://twitter.com/intent/user?screen_name=itmdata
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Jean-Baptiste Onofré
>>>>>>>>>> [email protected]
>>>>>>>>>> http://blog.nanthrax.net
>>>>>>>>>> Talend - http://www.talend.com
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> --
>>>>>>>> Jean-Baptiste Onofré
>>>>>>>> [email protected]
>>>>>>>> http://blog.nanthrax.net
>>>>>>>> Talend - http://www.talend.com
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> Jean-Baptiste Onofré
>>>> [email protected]
>>>> http://blog.nanthrax.net
>>>> Talend - http://www.talend.com
>>>> 
>>> 
>> 
> 
> -- 
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 
>

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

Reply via email to