Thanks for everyone who wanted to fill out the doodle poll. The most popular time was Friday Sept 14th from 11am-noon PST. I'll send out a calendar invite and meeting link early next week.
I have received a lot of feedback on the document and have addressed some parts of it including: * clarifying terminology * processing skew due to some restrictions having their watermarks much further behind then others affecting scheduling of bundles by runners * external throttling & I/O wait overhead reporting to make sure we don't overscale Areas that still need additional feedback and details are: * reporting progress around the work that is done and is active * more examples * unbounded restrictions being caused by an unbounded number of splits of existing unbounded restrictions (infinite work growth) * whether we should be reporting this information at the PTransform level or at the bundle level On Wed, Sep 5, 2018 at 1:53 PM Lukasz Cwik <[email protected]> wrote: > Thanks to all those who have provided interest in this topic by the > questions they have asked on the doc already and for those interested in > having this discussion. I have setup this doodle to allow people to provide > their availability: > https://doodle.com/poll/nrw7w84255xnfwqy > > I'll send out the chosen time based upon peoples availability and a > Hangout link by end of day Friday so please mark your availability using > the link above. > > The agenda of the meeting will be as follows: > * Overview of the proposal > * Enumerate and discuss/answer questions brought up in the meeting > > Note that all questions and any discussions/answers provided will be added > to the doc for those who are unable to attend. > > On Fri, Aug 31, 2018 at 9:47 AM Jean-Baptiste Onofré <[email protected]> > wrote: > >> +1 >> >> Regards >> JB >> Le 31 août 2018, à 18:22, Lukasz Cwik <[email protected]> a écrit: >>> >>> That is possible, I'll take people's date/time suggestions and create a >>> simple online poll with them. >>> >>> On Fri, Aug 31, 2018 at 2:22 AM Robert Bradshaw <[email protected]> >>> wrote: >>> >>>> Thanks for taking this up. I added some comments to the doc. A >>>> European-friendly time for discussion would be great. >>>> >>>> On Fri, Aug 31, 2018 at 3:14 AM Lukasz Cwik <[email protected]> wrote: >>>> >>>>> I came up with a proposal[1] for a progress model solely based off of >>>>> the backlog and that splits should be based upon the remaining backlog we >>>>> want the SDK to split at. I also give recommendations to runner authors as >>>>> to how an autoscaling system could work based upon the measured backlog. A >>>>> lot of discussions around progress reporting and splitting in the past has >>>>> always been around finding an optimal solution, after reading a lot of >>>>> information about work stealing, I don't believe there is a general >>>>> solution and it really is upto SplittableDoFns to be well behaved. I did >>>>> not do much work in classifying what a well behaved SplittableDoFn is >>>>> though. Much of this work builds off ideas that Eugene had documented in >>>>> the past[2]. >>>>> >>>>> I could use the communities wide knowledge of different I/Os to see if >>>>> computing the backlog is practical in the way that I'm suggesting and to >>>>> gather people's feedback. >>>>> >>>>> If there is a lot of interest, I would like to hold a community video >>>>> conference between Sept 10th and 14th about this topic. Please reply with >>>>> your availability by Sept 6th if your interested. >>>>> >>>>> 1: https://s.apache.org/beam-bundles-backlog-splitting >>>>> 2: https://s.apache.org/beam-breaking-fusion >>>>> >>>>> On Mon, Aug 13, 2018 at 10:21 AM Jean-Baptiste Onofré <[email protected]> >>>>> wrote: >>>>> >>>>>> Awesome ! >>>>>> >>>>>> Thanks Luke ! >>>>>> >>>>>> I plan to work with you and others on this one. >>>>>> >>>>>> Regards >>>>>> JB >>>>>> Le 13 août 2018, à 19:14, Lukasz Cwik <[email protected]> a écrit: >>>>>>> >>>>>>> I wanted to reach out that I will be continuing from where Eugene >>>>>>> left off with SplittableDoFn. I know that many of you have done a bunch >>>>>>> of >>>>>>> work with IOs and/or runner integration for SplittableDoFn and would >>>>>>> appreciate your help in advancing this awesome idea. If you have >>>>>>> questions >>>>>>> or things you want to get reviewed related to SplittableDoFn, feel free >>>>>>> to >>>>>>> send them my way or include me on anything SplittableDoFn related. >>>>>>> >>>>>>> I was part of several discussions with Eugene and I think the >>>>>>> biggest outstanding design portion is to figure out how dynamic work >>>>>>> rebalancing would play out with the portability APIs. This includes >>>>>>> reporting of progress from within a bundle. I know that Eugene had >>>>>>> shared >>>>>>> some documents in this regard but the position / split models didn't >>>>>>> work >>>>>>> too cleanly in a unified sense for bounded and unbounded >>>>>>> SplittableDoFns. >>>>>>> It will likely take me awhile to gather my thoughts but could use your >>>>>>> expertise as to how compatible these ideas are with respect to to IOs >>>>>>> and >>>>>>> runners Flink/Spark/Dataflow/Samza/Apex/... and obviously help during >>>>>>> implementation. >>>>>>> >>>>>>
