Amit - 8am is fine with me, let's do that. On Wed, Mar 15, 2017 at 6:00 AM Jean-Baptiste Onofré <[email protected]> wrote:
> Hi, > > Anyway, I hope it will result with some notes on the mailing list as it > could be > helpful. > > I'm not against a video call to move forward, but, from ma community > perspective, we should always provide minute notes on the mailing list. > > Unfortunately, next Friday, I will still be in China, so not possible to > join > (even if I would have like to participate :(). > > Regards > JB > > On 03/15/2017 07:45 PM, Amit Sela wrote: > > I have dinner at 9am.. which doesn't sound like a real thing if you > forget > > about timezones J > > How about 8am ? or something later like 12pm mid-day ? > > Apex can take the 9am time slot ;-) > > > > On Wed, Mar 15, 2017 at 4:28 AM Eugene Kirpichov > > <[email protected]> wrote: > > > >> Hi! Please feel free to join this call, but I think we'd be mostly > >> discussing how to do it in the Spark runner in particular; so we'll > >> probably need another call for Apex anyway. > >> > >> On Tue, Mar 14, 2017 at 6:54 PM Thomas Weise <[email protected]> wrote: > >> > >>> Hi Eugene, > >>> > >>> This would work for me also. Please let me know if you want to keep the > >>> Apex related discussion separate or want me to join this call. > >>> > >>> Thanks, > >>> Thomas > >>> > >>> > >>> On Tue, Mar 14, 2017 at 1:56 PM, Eugene Kirpichov < > >>> [email protected]> wrote: > >>> > >>>> Sure, Friday morning sounds good. How about 9am Friday PST, at > >> videocall > >>> by > >>>> link https://hangouts.google.com/hangouts/_/google.com/splittabledofn > >> ? > >>>> > >>>> On Mon, Mar 13, 2017 at 10:30 PM Amit Sela <[email protected]> > >> wrote: > >>>> > >>>>> PST mornings are better, because they are evening/nights for me. > >> Friday > >>>>> would work-out best for me. > >>>>> > >>>>> On Mon, Mar 13, 2017 at 11:46 PM Eugene Kirpichov > >>>>> <[email protected]> wrote: > >>>>> > >>>>>> Awesome!!! > >>>>>> > >>>>>> Amit - remind me your time zone? JB, do you want to join? > >>>>>> I'm free this week all afternoons (say after 2pm) in Pacific Time, > >>> and > >>>>>> mornings of Wed & Fri. We'll probably need half an hour to an hour. > >>>>>> > >>>>>> On Mon, Mar 13, 2017 at 1:29 PM Aljoscha Krettek < > >>> [email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> I whipped up a quick version for Flink that seems to work: > >>>>>>> https://github.com/apache/beam/pull/2235 > >>>>>>> > >>>>>>> There are still two failing tests, as described in the PR. > >>>>>>> > >>>>>>> On Mon, Mar 13, 2017, at 20:10, Amit Sela wrote: > >>>>>>>> +1 for a video call. I think it should be pretty straight > >> forward > >>>> for > >>>>>> the > >>>>>>>> Spark runner after the work on read from UnboundedSource and > >>> after > >>>>>>>> GroupAlsoByWindow, but from my experience such a call could > >> move > >>> us > >>>>>>>> forward > >>>>>>>> fast enough. > >>>>>>>> > >>>>>>>> On Mon, Mar 13, 2017, 20:37 Eugene Kirpichov < > >>> [email protected] > >>>>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi all, > >>>>>>>>> > >>>>>>>>> Let us continue working on this. I am back from various > >> travels > >>>> and > >>>>>> am > >>>>>>>>> eager to help. > >>>>>>>>> > >>>>>>>>> Amit, JB - would you like to perhaps have a videocall to hash > >>>> this > >>>>>> out > >>>>>>> for > >>>>>>>>> the Spark runner? > >>>>>>>>> > >>>>>>>>> Aljoscha - are the necessary Flink changes done / or is the > >>> need > >>>>> for > >>>>>>> them > >>>>>>>>> obviated by using the (existing) runner-facing state/timer > >>> APIs? > >>>>>>> Should we > >>>>>>>>> have a videocall too? > >>>>>>>>> > >>>>>>>>> Thomas - what do you think about getting this into Apex > >> runner? > >>>>>>>>> > >>>>>>>>> (I think videocalls will allow to make rapid progress, but > >> it's > >>>>>>> probably a > >>>>>>>>> better idea to keep them separate since they'll involve a lot > >>> of > >>>>>>>>> runner-specific details) > >>>>>>>>> > >>>>>>>>> PS - The completion of this in Dataflow streaming runner is > >>>>> currently > >>>>>>>>> waiting only on having a small service-side change > >> implemented > >>>> and > >>>>>>> rolled > >>>>>>>>> out for termination of streaming jobs. > >>>>>>>>> > >>>>>>>>> On Wed, Feb 8, 2017 at 10:55 AM Kenneth Knowles < > >>> [email protected]> > >>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> I recommend proceeding with the runner-facing state & timer > >>> APIs; > >>>>>> they > >>>>>>> are > >>>>>>>>> lower-level and more appropriate for this. All runners > >> provide > >>>> them > >>>>>> or > >>>>>>> use > >>>>>>>>> runners/core implementations, as they are needed for > >>> triggering. > >>>>>>>>> > >>>>>>>>> On Wed, Feb 8, 2017 at 10:34 AM, Eugene Kirpichov < > >>>>>>> [email protected]> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> Thanks Aljoscha! > >>>>>>>>> > >>>>>>>>> Minor note: I'm not familiar with what level of support for > >>>> timers > >>>>>>> Flink > >>>>>>>>> currently has - however SDF in Direct and Dataflow runner > >>>> currently > >>>>>>> does > >>>>>>>>> not use the user-facing state/timer APIs - rather, it uses > >> the > >>>>>>>>> runner-facing APIs (StateInternals and TimerInternals) - > >>> perhaps > >>>>>> Flink > >>>>>>>>> already implements these. We may want to change this, but for > >>> now > >>>>>> it's > >>>>>>> good > >>>>>>>>> enough (besides, SDF uses watermark holds, which are not > >>>> supported > >>>>> by > >>>>>>> the > >>>>>>>>> user-facing state API yet). > >>>>>>>>> > >>>>>>>>> On Wed, Feb 8, 2017 at 10:19 AM Aljoscha Krettek < > >>>>>>>>> [email protected]> wrote: > >>>>>>>>> > >>>>>>>>> Thanks for the motivation, Eugene! :-) > >>>>>>>>> > >>>>>>>>> I've wanted to do this for a while now but was waiting for > >> the > >>>>> Flink > >>>>>>> 1.2 > >>>>>>>>> release (which happened this week)! There's some prerequisite > >>>> work > >>>>> to > >>>>>>> be > >>>>>>>>> done on the Flink runner: we'll move to the new timer > >>> interfaces > >>>>>>> introduced > >>>>>>>>> in Flink 1.2 and implement support for both the user facing > >>> state > >>>>> and > >>>>>>> timer > >>>>>>>>> APIs. This should make implementation of SDF easier. > >>>>>>>>> > >>>>>>>>> On Wed, Feb 8, 2017 at 7:06 PM, Eugene Kirpichov < > >>>>>> [email protected] > >>>>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> Thanks! Looking forward to this work. > >>>>>>>>> > >>>>>>>>> On Wed, Feb 8, 2017 at 3:50 AM Jean-Baptiste Onofré < > >>>>> [email protected] > >>>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> Thanks for the update Eugene. > >>>>>>>>> > >>>>>>>>> I will work on the spark runner with Amit. > >>>>>>>>> > >>>>>>>>> Regards > >>>>>>>>> JB > >>>>>>>>> > >>>>>>>>> On Feb 7, 2017, 19:12, at 19:12, Eugene Kirpichov > >>>>>>>>> <[email protected]> wrote: > >>>>>>>>>> Hello, > >>>>>>>>>> > >>>>>>>>>> I'm almost done adding support for Splittable DoFn > >>>>>>>>>> http://s.apache.org/splittable-do-fn to Dataflow streaming > >>>>> runner*, > >>>>>>> and > >>>>>>>>>> very excited about that. There's only 1 PR > >>>>>>>>>> <https://github.com/apache/beam/pull/1898> remaining, plus > >>>>> enabling > >>>>>>>>>> some > >>>>>>>>>> tests. > >>>>>>>>>> > >>>>>>>>>> * (batch runner is much harder because it's not yet quite > >>> clear > >>>> to > >>>>>> me > >>>>>>>>>> how > >>>>>>>>>> to properly implement liquid sharding > >>>>>>>>>> < > >>>>>>>>> > >>>>>>> > >>>>>> > >>>>> https://cloud.google.com/blog/big-data/2016/05/no-shard- > >>>> left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow > >>>>>>>>>> > >>>>>>>>>> with > >>>>>>>>>> SDF - and the current API is not ready for that yet) > >>>>>>>>>> > >>>>>>>>>> After implementing all the runner-agnostic parts of > >> Splittable > >>>>>> DoFn, I > >>>>>>>>>> found them quite easy to integrate into Dataflow streaming > >>>> runner, > >>>>>> and > >>>>>>>>>> I > >>>>>>>>>> think this means it should be easy to integrate into other > >>>> runners > >>>>>>> too. > >>>>>>>>>> > >>>>>>>>>> ====== Why it'd be cool ====== > >>>>>>>>>> The general benefits of SDF are well-described in the design > >>> doc > >>>>>>>>>> (linked > >>>>>>>>>> above). > >>>>>>>>>> As for right now - if we integrated SDF with all runners, > >> it'd > >>>>>> already > >>>>>>>>>> enable us to start greatly simplifying the code of existing > >>>>>> streaming > >>>>>>>>>> connectors (CountingInput, Kafka, Pubsub, JMS) and writing > >> new > >>>>>>>>>> connectors > >>>>>>>>>> (e.g. a really nice one to implement would be "directory > >>>> watcher", > >>>>>>> that > >>>>>>>>>> continuously returns new files in a directory). > >>>>>>>>>> > >>>>>>>>>> As a teaser, here's the complete implementation of an > >>> "unbounded > >>>>>>>>>> counter" I > >>>>>>>>>> used for my test of Dataflow runner integration: > >>>>>>>>>> > >>>>>>>>>> class CountFn extends DoFn<String, String> { > >>>>>>>>>> @ProcessElement > >>>>>>>>>> public ProcessContinuation process(ProcessContext c, > >>>>>>> OffsetRangeTracker > >>>>>>>>>> tracker) { > >>>>>>>>>> for (int i = tracker.currentRestriction().getFrom(); > >>>>>>>>>> tracker.tryClaim(i); ++i) c.output(i); > >>>>>>>>>> return resume(); > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> @GetInitialRestriction > >>>>>>>>>> public OffsetRange getInitialRange(String element) { > >>> return > >>>>> new > >>>>>>>>>> OffsetRange(0, Integer.MAX_VALUE); } > >>>>>>>>>> > >>>>>>>>>> @NewTracker > >>>>>>>>>> public OffsetRangeTracker newTracker(OffsetRange range) { > >>>>> return > >>>>>>> new > >>>>>>>>>> OffsetRangeTracker(range); } > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> ====== What I'm asking ====== > >>>>>>>>>> So, I'd like to ask for help integrating SDF into Spark, > >> Flink > >>>> and > >>>>>>> Apex > >>>>>>>>>> runners from people who are intimately familiar with them - > >>>>>>>>>> specifically, I > >>>>>>>>>> was hoping best-case I could nerd-snipe some of you into > >>> taking > >>>>> over > >>>>>>>>>> the > >>>>>>>>>> integration of SDF with your favorite runner ;) > >>>>>>>>>> > >>>>>>>>>> The proper set of people seems to be +Aljoscha Krettek > >>>>>>>>>> <[email protected]> +Maximilian Michels > >>>>>>>>>> <[email protected]> > >>>>>>>>>> [email protected] <[email protected]> +Amit Sela > >>>>>>>>>> <[email protected]> +Thomas > >>>>>>>>>> Weise unless I forgot somebody. > >>>>>>>>>> > >>>>>>>>>> Average-case, I was looking for runner-specific guidance on > >>> how > >>>> to > >>>>>> do > >>>>>>>>>> it > >>>>>>>>>> myself. > >>>>>>>>>> > >>>>>>>>>> ====== If you want to help ====== > >>>>>>>>>> If somebody decides to take this over, in my absence (I'll > >> be > >>>>> mostly > >>>>>>>>>> gone > >>>>>>>>>> for ~the next month)., the best people to ask for > >>> implementation > >>>>>>>>>> advice are +Kenn > >>>>>>>>>> Knowles <[email protected]> and +Daniel Mills < > >> [email protected] > >>>> > >>>> . > >>>>>>>>>> > >>>>>>>>>> For reference, here's how SDF is implemented in the direct > >>>> runner: > >>>>>>>>>> - Direct runner overrides > >>>>>>>>>> < > >>>>>>>>> > >>>>>>> > >>>>>> > >>>>> https://github.com/apache/beam/blob/0616245e654c60ae94cc2c188f857b > >>>> 74a62d9b24/runners/direct-java/src/main/java/org/apache/ > >>>> beam/runners/direct/ParDoMultiOverrideFactory.java > >>>>>>>>>> > >>>>>>>>>> ParDo.of() for a splittable DoFn and replaces it with > >>>>>> SplittableParDo > >>>>>>>>>> < > >>>>>>>>> > >>>>>>> > >>>>>> > >>>>> https://github.com/apache/beam/blob/master/runners/core- > >>>> java/src/main/java/org/apache/beam/runners/core/SplittableParDo.java > >>>>>>>>>> > >>>>>>>>>> (common > >>>>>>>>>> transform expansion) > >>>>>>>>>> - SplittableParDo uses two runner-specific primitive > >>> transforms: > >>>>>>>>>> "GBKIntoKeyedWorkItems" and "SplittableProcessElements". > >>> Direct > >>>>>> runner > >>>>>>>>>> overrides the first one like this > >>>>>>>>>> < > >>>>>>>>> > >>>>>>> > >>>>>> > >>>>> https://github.com/apache/beam/blob/cc28f0cb4c44169f933475ae29a325 > >>>> 99024d3a1f/runners/direct-java/src/main/java/org/apache/ > >>>> beam/runners/direct/DirectGBKIntoKeyedWorkItemsOverrideFactory.java > >>>>>>>>>> , > >>>>>>>>>> and directly implements evaluation of the second one like > >> this > >>>>>>>>>> < > >>>>>>>>> > >>>>>>> > >>>>>> > >>>>> https://github.com/apache/beam/blob/cc28f0cb4c44169f933475ae29a325 > >>>> 99024d3a1f/runners/direct-java/src/main/java/org/apache/ > >>>> beam/runners/direct/SplittableProcessElementsEvaluatorFactory.java > >>>>>>>>>> , > >>>>>>>>>> using runner hooks introduced in this PR > >>>>>>>>>> <https://github.com/apache/beam/pull/1824>. At the core of > >>> the > >>>>>> hooks > >>>>>>> is > >>>>>>>>>> "ProcessFn" which is like a regular DoFn but has to be > >>> prepared > >>>> at > >>>>>>>>>> runtime > >>>>>>>>>> with some hooks (state, timers, and runner access to > >>>>>>>>>> RestrictionTracker) > >>>>>>>>>> before you invoke it. I added a convenience implementation > >> of > >>>> the > >>>>>> hook > >>>>>>>>>> mimicking behavior of UnboundedSource. > >>>>>>>>>> - The relevant runner-agnostic tests are in > >> SplittableDoFnTest > >>>>>>>>>> < > >>>>>>>>> > >>>>>>> > >>>>>> > >>>>> https://github.com/apache/beam/blob/cc28f0cb4c44169f933475ae29a325 > >>>> 99024d3a1f/sdks/java/core/src/test/java/org/apache/beam/sdk/ > >>>> transforms/SplittableDoFnTest.java > >>>>>>>>>> > >>>>>>>>>> . > >>>>>>>>>> > >>>>>>>>>> That's all it takes, really - the runner has to implement > >>> these > >>>>> two > >>>>>>>>>> transforms. When I looked at Spark and Flink runners, it was > >>> not > >>>>>> quite > >>>>>>>>>> clear to me how to implement the GBKIntoKeyedWorkItems > >>>> transform, > >>>>>> e.g. > >>>>>>>>>> Spark runner currently doesn't use KeyedWorkItem at all - > >> but > >>> it > >>>>>> seems > >>>>>>>>>> definitely possible. > >>>>>>>>>> > >>>>>>>>>> Thanks! > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Data Artisans GmbH | Stresemannstr. 121A | 10963 Berlin > >>>>>>>>> > >>>>>>>>> [email protected] > >>>>>>>>> +49-(0)30-55599146 <+49%2030%2055599146> <+49%2030%2055599146> > >> <+49%2030%2055599146> > >>> <+49%2030%2055599146> <+49%2030%2055599146> > >>>>> <+49%2030%2055599146> > >>>>>>>>> > >>>>>>>>> Registered at Amtsgericht Charlottenburg - HRB 158244 B > >>>>>>>>> Managing Directors: Kostas Tzoumas, Stephan Ewen > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > > -- > Jean-Baptiste Onofré > [email protected] > http://blog.nanthrax.net > Talend - http://www.talend.com >
