Nice one Alex ! Thanks Regards JB
On 02/10/2018 23:19, Alex Van Boxel wrote: > Don't want to crash the tech discussion here, but... I just gave a > session at the Beam Summit about Splittable DoFn's as a users > perspective (from things I could gather from the documentation and > experimentation). Her is the slides deck, maybe it could be > useful: > https://docs.google.com/presentation/d/1dSc6oKh5pZItQPB_QiUyEoLT2TebMnj-pmdGipkVFPk/edit?usp=sharing > (quite > proud of the animations though ;-) > > _/ > _/ Alex Van Boxel > > > On Thu, Sep 27, 2018 at 12:04 AM Lukasz Cwik <[email protected] > <mailto:[email protected]>> wrote: > > Reuven, just inside the restriction tracker itself which is scoped > per executing SplittableDoFn. A user could incorrectly write the > synchronization since they are currently responsible for writing it > though. > > On Wed, Sep 26, 2018 at 2:51 PM Reuven Lax <[email protected] > <mailto:[email protected]>> wrote: > > is synchronization over an entire work item, or just inside > restriction tracker? my concern is that some runners (especially > streaming runners) might have hundreds or thousands of parallel > work items being processed for the same SDF (for different > keys), and I'm afraid of creating lock-contention bottlenecks. > > On Fri, Sep 21, 2018 at 3:42 PM Lukasz Cwik <[email protected] > <mailto:[email protected]>> wrote: > > The synchronization is related to Java thread safety since > there is likely to be concurrent access needed to a > restriction tracker to properly handle accessing the backlog > and splitting concurrently from when the users DoFn is > executing and updating the restriction tracker. This is > similar to the Java thread safety needed in BoundedSource > and UnboundedSource for fraction consumed, backlog bytes, > and splitting. > > On Fri, Sep 21, 2018 at 2:38 PM Reuven Lax <[email protected] > <mailto:[email protected]>> wrote: > > Can you give details on what the synchronization is per? > Is it per key, or global to each worker? > > On Fri, Sep 21, 2018 at 2:10 PM Lukasz Cwik > <[email protected] <mailto:[email protected]>> wrote: > > As I was looking at the SplittableDoFn API while > working towards making a proposal for how the > backlog/splitting API could look, I found some sharp > edges that could be improved. > > I noticed that: > 1) We require users to write thread safe code, this > is something that we haven't asked of users when > writing a DoFn. > 2) We "internal" methods within the > RestrictionTracker that are not meant to be used by > the runner. > > I can fix these issues by giving the user a > forwarding restriction tracker[1] that provides an > appropriate level of synchronization as needed and > also provides the necessary observation hooks to see > when a claim failed or succeeded. > > This requires a change to our experimental API since > we need to pass a RestrictionTracker to the > @ProcessElement method instead of a sub-type of > RestrictionTracker. > @ProcessElement > processElement(ProcessContext context, > OffsetRangeTracker tracker) { ... } > becomes: > @ProcessElement > processElement(ProcessContext context, > RestrictionTracker<OffsetRange, Long> tracker) { ... } > > This provides an additional benefit that it prevents > users from working around the RestrictionTracker > APIs and potentially making underlying changes to > the tracker outside of the tryClaim call. > > Full implementation is available within this PR[2] > and was wondering what people thought. > > 1: > https://github.com/apache/beam/pull/6467/files#diff-ed95abb6bc30a9ed07faef5c3fea93f0R72 > 2: https://github.com/apache/beam/pull/6467 > > > On Mon, Sep 17, 2018 at 12:45 PM Lukasz Cwik > <[email protected] <mailto:[email protected]>> wrote: > > The changes to the API have not been proposed > yet. So far it has all been about what is the > representation and why. > > For splitting, the current idea has been about > using the backlog as a way of telling the > SplittableDoFn where to split, so it would be in > terms of whatever the SDK decided to report. > The runner always chooses a number for backlog > that is relative to the SDKs reported backlog. > It would be upto the SDK to round/clamp the > number given by the Runner to represent > something meaningful for itself. > For example if the backlog that the SDK was > reporting was bytes remaining in a file such as > 500, then the Runner could provide some value > like 212.2 which the SDK would then round to 212. > If the backlog that the SDK was reporting 57 > pubsub messages, then the Runner could provide a > value like 300 which would mean to read 57 > values and then another 243 as part of the > current restriction. > > I believe that BoundedSource/UnboundedSource > will have wrappers added that provide a basic > SplittableDoFn implementation so existing IOs > should be migrated over without API changes. > > On Mon, Sep 17, 2018 at 1:11 AM Ismaël Mejía > <[email protected] <mailto:[email protected]>> > wrote: > > Thanks a lot Luke for bringing this back to > the mailing list and Ryan for taking > the notes. > > I would like to know if there was some > discussion, or if you guys have given > some thought to the required changes in the > SDK (API) part. What will be the > equivalent of `splitAtFraction` and what > should IO authors do to support it.. > > On Sat, Sep 15, 2018 at 1:52 AM Lukasz Cwik > <[email protected] <mailto:[email protected]>> > wrote: > > > > Thanks to everyone who joined and for the > questions asked. > > > > Ryan graciously collected notes of the > discussion: > > https://docs.google.com/document/d/1kjJLGIiNAGvDiUCMEtQbw8tyOXESvwGeGZLL-0M06fQ/edit?usp=sharing > > > > The summary was that bringing > BoundedSource/UnboundedSource into using a > unified backlog-reporting mechanism with > optional other signals that Dataflow has > found useful (such as is the remaining > restriction splittable (yes, no, unknown)). > Other runners can use or not. SDFs should > report backlog and watermark as minimum bar. > The backlog should use an arbitrary > precision float such as Java BigDecimal to > prevent issues where limited precision > removes the ability to compute delta > efficiently. > > > > > > > > On Wed, Sep 12, 2018 at 3:54 PM Lukasz > Cwik <[email protected] > <mailto:[email protected]>> wrote: > >> > >> Here is the link to join the discussion: > https://meet.google.com/idc-japs-hwf > >> Remember that it is this Friday Sept 14th > from 11am-noon PST. > >> > >> > >> > >> On Mon, Sep 10, 2018 at 7:30 AM > Maximilian Michels <[email protected] > <mailto:[email protected]>> wrote: > >>> > >>> Thanks for moving forward with this, Lukasz! > >>> > >>> Unfortunately, can't make it on Friday > but I'll sync with somebody on > >>> the call (e.g. Ryan) about your discussion. > >>> > >>> On 08.09.18 02:00, Lukasz Cwik wrote: > >>> > Thanks for everyone who wanted to fill > out the doodle poll. The most > >>> > popular time was Friday Sept 14th from > 11am-noon PST. I'll send out a > >>> > calendar invite and meeting link early > next week. > >>> > > >>> > I have received a lot of feedback on > the document and have addressed > >>> > some parts of it including: > >>> > * clarifying terminology > >>> > * processing skew due to some > restrictions having their watermarks much > >>> > further behind then others affecting > scheduling of bundles by runners > >>> > * external throttling & I/O wait > overhead reporting to make sure we > >>> > don't overscale > >>> > > >>> > Areas that still need additional > feedback and details are: > >>> > * reporting progress around the work > that is done and is active > >>> > * more examples > >>> > * unbounded restrictions being caused > by an unbounded number of splits > >>> > of existing unbounded restrictions > (infinite work growth) > >>> > * whether we should be reporting this > information at the PTransform > >>> > level or at the bundle level > >>> > > >>> > > >>> > > >>> > On Wed, Sep 5, 2018 at 1:53 PM Lukasz > Cwik <[email protected] <mailto:[email protected]> > >>> > <mailto:[email protected] > <mailto:[email protected]>>> wrote: > >>> > > >>> > Thanks to all those who have > provided interest in this topic by the > >>> > questions they have asked on the > doc already and for those > >>> > interested in having this > discussion. I have setup this doodle to > >>> > allow people to provide their > availability: > >>> > > https://doodle.com/poll/nrw7w84255xnfwqy > >>> > > >>> > I'll send out the chosen time > based upon peoples availability and a > >>> > Hangout link by end of day Friday > so please mark your availability > >>> > using the link above. > >>> > > >>> > The agenda of the meeting will be > as follows: > >>> > * Overview of the proposal > >>> > * Enumerate and discuss/answer > questions brought up in the meeting > >>> > > >>> > Note that all questions and any > discussions/answers provided will be > >>> > added to the doc for those who are > unable to attend. > >>> > > >>> > On Fri, Aug 31, 2018 at 9:47 AM > Jean-Baptiste Onofré > >>> > <[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> wrote: > >>> > > >>> > +1 > >>> > > >>> > Regards > >>> > JB > >>> > Le 31 août 2018, à 18:22, > Lukasz Cwik <[email protected] > <mailto:[email protected]> > >>> > <mailto:[email protected] > <mailto:[email protected]>>> a écrit: > >>> > > >>> > That is possible, I'll > take people's date/time suggestions > >>> > and create a simple online > poll with them. > >>> > > >>> > On Fri, Aug 31, 2018 at > 2:22 AM Robert Bradshaw > >>> > <[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> wrote: > >>> > > >>> > Thanks for taking this > up. I added some comments to the > >>> > doc. A > European-friendly time for discussion would > >>> > be great. > >>> > > >>> > On Fri, Aug 31, 2018 > at 3:14 AM Lukasz Cwik > >>> > <[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> wrote: > >>> > > >>> > I came up with a > proposal[1] for a progress model > >>> > solely based off > of the backlog and that splits > >>> > should be based > upon the remaining backlog we want > >>> > the SDK to split > at. I also give recommendations to > >>> > runner authors as > to how an autoscaling system could > >>> > work based upon > the measured backlog. A lot of > >>> > discussions around > progress reporting and splitting > >>> > in the past has > always been around finding an > >>> > optimal solution, > after reading a lot of information > >>> > about work > stealing, I don't believe there is a > >>> > general solution > and it really is upto > >>> > SplittableDoFns to > be well behaved. I did not do > >>> > much work in > classifying what a well behaved > >>> > SplittableDoFn is > though. Much of this work builds > >>> > off ideas that > Eugene had documented in the past[2]. > >>> > > >>> > I could use the > communities wide knowledge of > >>> > different I/Os to > see if computing the backlog is > >>> > practical in the > way that I'm suggesting and to > >>> > gather people's > feedback. > >>> > > >>> > If there is a lot > of interest, I would like to hold > >>> > a community video > conference between Sept 10th and > >>> > 14th about this > topic. Please reply with your > >>> > availability by > Sept 6th if your interested. > >>> > > >>> > 1: > > https://s.apache.org/beam-bundles-backlog-splitting > >>> > 2: > https://s.apache.org/beam-breaking-fusion > >>> > > >>> > On Mon, Aug 13, > 2018 at 10:21 AM Jean-Baptiste > >>> > Onofré > <[email protected] <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> wrote: > >>> > > >>> > Awesome ! > >>> > > >>> > Thanks Luke ! > >>> > > >>> > I plan to work > with you and others on this one. > >>> > > >>> > Regards > >>> > JB > >>> > Le 13 août > 2018, à 19:14, Lukasz Cwik > >>> > > <[email protected] <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> a > >>> > écrit: > >>> > > >>> > I wanted > to reach out that I will be > >>> > continuing > from where Eugene left off with > >>> > > SplittableDoFn. I know that many of you have > >>> > done a > bunch of work with IOs and/or runner > >>> > > integration for SplittableDoFn and would > >>> > appreciate > your help in advancing this > >>> > awesome > idea. If you have questions or > >>> > things you > want to get reviewed related to > >>> > > SplittableDoFn, feel free to send them my > >>> > way or > include me on anything SplittableDoFn > >>> > related. > >>> > > >>> > I was part > of several discussions with > >>> > Eugene and > I think the biggest outstanding > >>> > design > portion is to figure out how dynamic > >>> > work > rebalancing would play out with the > >>> > > portability APIs. This includes reporting of > >>> > progress > from within a bundle. I know that > >>> > Eugene had > shared some documents in this > >>> > regard but > the position / split models > >>> > didn't > work too cleanly in a unified sense > >>> > for > bounded and unbounded SplittableDoFns. > >>> > It will > likely take me awhile to gather my > >>> > thoughts > but could use your expertise as to > >>> > how > compatible these ideas are with respect > >>> > to to IOs > and runners > >>> > > Flink/Spark/Dataflow/Samza/Apex/... and > >>> > obviously > help during implementation. > >>> > > -- Jean-Baptiste Onofré [email protected] http://blog.nanthrax.net Talend - http://www.talend.com
