Yes, but we need at least Mongo 4.0 to make it production ready. I wouldn't let anyone work with anything less because you can't checkpoint). I'm waiting till our test cluster is 4.0 to continue on this.
_/ _/ Alex Van Boxel On Wed, Oct 3, 2018 at 9:43 AM Ismaël Mejía <[email protected]> wrote: > Hello Axel, Thanks for sharing, really interesting quest story, we > really need more like this (kudos for the animations too). > Are you planning to contribute the continous SDF based version of the > mongo connector into Beam upstream (once ready)? > > > On Wed, Oct 3, 2018 at 7:07 AM Jean-Baptiste Onofré <[email protected]> > wrote: > > > > Nice one Alex ! > > > > Thanks > > Regards > > JB > > > > On 02/10/2018 23:19, Alex Van Boxel wrote: > > > Don't want to crash the tech discussion here, but... I just gave a > > > session at the Beam Summit about Splittable DoFn's as a users > > > perspective (from things I could gather from the documentation and > > > experimentation). Her is the slides deck, maybe it could be > > > useful: > https://docs.google.com/presentation/d/1dSc6oKh5pZItQPB_QiUyEoLT2TebMnj-pmdGipkVFPk/edit?usp=sharing > (quite > > > proud of the animations though ;-) > > > > > > _/ > > > _/ Alex Van Boxel > > > > > > > > > On Thu, Sep 27, 2018 at 12:04 AM Lukasz Cwik <[email protected] > > > <mailto:[email protected]>> wrote: > > > > > > Reuven, just inside the restriction tracker itself which is scoped > > > per executing SplittableDoFn. A user could incorrectly write the > > > synchronization since they are currently responsible for writing it > > > though. > > > > > > On Wed, Sep 26, 2018 at 2:51 PM Reuven Lax <[email protected] > > > <mailto:[email protected]>> wrote: > > > > > > is synchronization over an entire work item, or just inside > > > restriction tracker? my concern is that some runners > (especially > > > streaming runners) might have hundreds or thousands of parallel > > > work items being processed for the same SDF (for different > > > keys), and I'm afraid of creating lock-contention bottlenecks. > > > > > > On Fri, Sep 21, 2018 at 3:42 PM Lukasz Cwik <[email protected] > > > <mailto:[email protected]>> wrote: > > > > > > The synchronization is related to Java thread safety since > > > there is likely to be concurrent access needed to a > > > restriction tracker to properly handle accessing the > backlog > > > and splitting concurrently from when the users DoFn is > > > executing and updating the restriction tracker. This is > > > similar to the Java thread safety needed in BoundedSource > > > and UnboundedSource for fraction consumed, backlog bytes, > > > and splitting. > > > > > > On Fri, Sep 21, 2018 at 2:38 PM Reuven Lax < > [email protected] > > > <mailto:[email protected]>> wrote: > > > > > > Can you give details on what the synchronization is > per? > > > Is it per key, or global to each worker? > > > > > > On Fri, Sep 21, 2018 at 2:10 PM Lukasz Cwik > > > <[email protected] <mailto:[email protected]>> wrote: > > > > > > As I was looking at the SplittableDoFn API while > > > working towards making a proposal for how the > > > backlog/splitting API could look, I found some > sharp > > > edges that could be improved. > > > > > > I noticed that: > > > 1) We require users to write thread safe code, this > > > is something that we haven't asked of users when > > > writing a DoFn. > > > 2) We "internal" methods within the > > > RestrictionTracker that are not meant to be used by > > > the runner. > > > > > > I can fix these issues by giving the user a > > > forwarding restriction tracker[1] that provides an > > > appropriate level of synchronization as needed and > > > also provides the necessary observation hooks to > see > > > when a claim failed or succeeded. > > > > > > This requires a change to our experimental API > since > > > we need to pass a RestrictionTracker to the > > > @ProcessElement method instead of a sub-type of > > > RestrictionTracker. > > > @ProcessElement > > > processElement(ProcessContext context, > > > OffsetRangeTracker tracker) { ... } > > > becomes: > > > @ProcessElement > > > processElement(ProcessContext context, > > > RestrictionTracker<OffsetRange, Long> tracker) { > ... } > > > > > > This provides an additional benefit that it > prevents > > > users from working around the RestrictionTracker > > > APIs and potentially making underlying changes to > > > the tracker outside of the tryClaim call. > > > > > > Full implementation is available within this PR[2] > > > and was wondering what people thought. > > > > > > 1: > https://github.com/apache/beam/pull/6467/files#diff-ed95abb6bc30a9ed07faef5c3fea93f0R72 > > > 2: https://github.com/apache/beam/pull/6467 > > > > > > > > > On Mon, Sep 17, 2018 at 12:45 PM Lukasz Cwik > > > <[email protected] <mailto:[email protected]>> > wrote: > > > > > > The changes to the API have not been proposed > > > yet. So far it has all been about what is the > > > representation and why. > > > > > > For splitting, the current idea has been about > > > using the backlog as a way of telling the > > > SplittableDoFn where to split, so it would be > in > > > terms of whatever the SDK decided to report. > > > The runner always chooses a number for backlog > > > that is relative to the SDKs reported backlog. > > > It would be upto the SDK to round/clamp the > > > number given by the Runner to represent > > > something meaningful for itself. > > > For example if the backlog that the SDK was > > > reporting was bytes remaining in a file such as > > > 500, then the Runner could provide some value > > > like 212.2 which the SDK would then round to > 212. > > > If the backlog that the SDK was reporting 57 > > > pubsub messages, then the Runner could provide > a > > > value like 300 which would mean to read 57 > > > values and then another 243 as part of the > > > current restriction. > > > > > > I believe that BoundedSource/UnboundedSource > > > will have wrappers added that provide a basic > > > SplittableDoFn implementation so existing IOs > > > should be migrated over without API changes. > > > > > > On Mon, Sep 17, 2018 at 1:11 AM Ismaël Mejía > > > <[email protected] <mailto:[email protected]>> > > > wrote: > > > > > > Thanks a lot Luke for bringing this back to > > > the mailing list and Ryan for taking > > > the notes. > > > > > > I would like to know if there was some > > > discussion, or if you guys have given > > > some thought to the required changes in the > > > SDK (API) part. What will be the > > > equivalent of `splitAtFraction` and what > > > should IO authors do to support it.. > > > > > > On Sat, Sep 15, 2018 at 1:52 AM Lukasz Cwik > > > <[email protected] <mailto:[email protected] > >> > > > wrote: > > > > > > > > Thanks to everyone who joined and for the > > > questions asked. > > > > > > > > Ryan graciously collected notes of the > > > discussion: > > > > https://docs.google.com/document/d/1kjJLGIiNAGvDiUCMEtQbw8tyOXESvwGeGZLL-0M06fQ/edit?usp=sharing > > > > > > > > The summary was that bringing > > > BoundedSource/UnboundedSource into using a > > > unified backlog-reporting mechanism with > > > optional other signals that Dataflow has > > > found useful (such as is the remaining > > > restriction splittable (yes, no, unknown)). > > > Other runners can use or not. SDFs should > > > report backlog and watermark as minimum > bar. > > > The backlog should use an arbitrary > > > precision float such as Java BigDecimal to > > > prevent issues where limited precision > > > removes the ability to compute delta > > > efficiently. > > > > > > > > > > > > > > > > On Wed, Sep 12, 2018 at 3:54 PM Lukasz > > > Cwik <[email protected] > > > <mailto:[email protected]>> wrote: > > > >> > > > >> Here is the link to join the discussion: > > > https://meet.google.com/idc-japs-hwf > > > >> Remember that it is this Friday Sept > 14th > > > from 11am-noon PST. > > > >> > > > >> > > > >> > > > >> On Mon, Sep 10, 2018 at 7:30 AM > > > Maximilian Michels <[email protected] > > > <mailto:[email protected]>> wrote: > > > >>> > > > >>> Thanks for moving forward with this, > Lukasz! > > > >>> > > > >>> Unfortunately, can't make it on Friday > > > but I'll sync with somebody on > > > >>> the call (e.g. Ryan) about your > discussion. > > > >>> > > > >>> On 08.09.18 02:00, Lukasz Cwik wrote: > > > >>> > Thanks for everyone who wanted to > fill > > > out the doodle poll. The most > > > >>> > popular time was Friday Sept 14th > from > > > 11am-noon PST. I'll send out a > > > >>> > calendar invite and meeting link > early > > > next week. > > > >>> > > > > >>> > I have received a lot of feedback on > > > the document and have addressed > > > >>> > some parts of it including: > > > >>> > * clarifying terminology > > > >>> > * processing skew due to some > > > restrictions having their watermarks much > > > >>> > further behind then others affecting > > > scheduling of bundles by runners > > > >>> > * external throttling & I/O wait > > > overhead reporting to make sure we > > > >>> > don't overscale > > > >>> > > > > >>> > Areas that still need additional > > > feedback and details are: > > > >>> > * reporting progress around the work > > > that is done and is active > > > >>> > * more examples > > > >>> > * unbounded restrictions being caused > > > by an unbounded number of splits > > > >>> > of existing unbounded restrictions > > > (infinite work growth) > > > >>> > * whether we should be reporting this > > > information at the PTransform > > > >>> > level or at the bundle level > > > >>> > > > > >>> > > > > >>> > > > > >>> > On Wed, Sep 5, 2018 at 1:53 PM Lukasz > > > Cwik <[email protected] <mailto: > [email protected]> > > > >>> > <mailto:[email protected] > > > <mailto:[email protected]>>> wrote: > > > >>> > > > > >>> > Thanks to all those who have > > > provided interest in this topic by the > > > >>> > questions they have asked on the > > > doc already and for those > > > >>> > interested in having this > > > discussion. I have setup this doodle to > > > >>> > allow people to provide their > > > availability: > > > >>> > > > > https://doodle.com/poll/nrw7w84255xnfwqy > > > >>> > > > > >>> > I'll send out the chosen time > > > based upon peoples availability and a > > > >>> > Hangout link by end of day Friday > > > so please mark your availability > > > >>> > using the link above. > > > >>> > > > > >>> > The agenda of the meeting will be > > > as follows: > > > >>> > * Overview of the proposal > > > >>> > * Enumerate and discuss/answer > > > questions brought up in the meeting > > > >>> > > > > >>> > Note that all questions and any > > > discussions/answers provided will be > > > >>> > added to the doc for those who > are > > > unable to attend. > > > >>> > > > > >>> > On Fri, Aug 31, 2018 at 9:47 AM > > > Jean-Baptiste Onofré > > > >>> > <[email protected] > > > <mailto:[email protected]> > > > <mailto:[email protected] > > > <mailto:[email protected]>>> wrote: > > > >>> > > > > >>> > +1 > > > >>> > > > > >>> > Regards > > > >>> > JB > > > >>> > Le 31 août 2018, à 18:22, > > > Lukasz Cwik <[email protected] > > > <mailto:[email protected]> > > > >>> > <mailto:[email protected] > > > <mailto:[email protected]>>> a écrit: > > > >>> > > > > >>> > That is possible, I'll > > > take people's date/time suggestions > > > >>> > and create a simple > online > > > poll with them. > > > >>> > > > > >>> > On Fri, Aug 31, 2018 at > > > 2:22 AM Robert Bradshaw > > > >>> > <[email protected] > > > <mailto:[email protected]> > > > <mailto:[email protected] > > > <mailto:[email protected]>>> wrote: > > > >>> > > > > >>> > Thanks for taking > this > > > up. I added some comments to the > > > >>> > doc. A > > > European-friendly time for discussion would > > > >>> > be great. > > > >>> > > > > >>> > On Fri, Aug 31, 2018 > > > at 3:14 AM Lukasz Cwik > > > >>> > <[email protected] > > > <mailto:[email protected]> > > > <mailto:[email protected] > > > <mailto:[email protected]>>> wrote: > > > >>> > > > > >>> > I came up with a > > > proposal[1] for a progress model > > > >>> > solely based off > > > of the backlog and that splits > > > >>> > should be based > > > upon the remaining backlog we want > > > >>> > the SDK to split > > > at. I also give recommendations to > > > >>> > runner authors as > > > to how an autoscaling system could > > > >>> > work based upon > > > the measured backlog. A lot of > > > >>> > discussions > around > > > progress reporting and splitting > > > >>> > in the past has > > > always been around finding an > > > >>> > optimal solution, > > > after reading a lot of information > > > >>> > about work > > > stealing, I don't believe there is a > > > >>> > general solution > > > and it really is upto > > > >>> > SplittableDoFns > to > > > be well behaved. I did not do > > > >>> > much work in > > > classifying what a well behaved > > > >>> > SplittableDoFn is > > > though. Much of this work builds > > > >>> > off ideas that > > > Eugene had documented in the past[2]. > > > >>> > > > > >>> > I could use the > > > communities wide knowledge of > > > >>> > different I/Os to > > > see if computing the backlog is > > > >>> > practical in the > > > way that I'm suggesting and to > > > >>> > gather people's > > > feedback. > > > >>> > > > > >>> > If there is a lot > > > of interest, I would like to hold > > > >>> > a community video > > > conference between Sept 10th and > > > >>> > 14th about this > > > topic. Please reply with your > > > >>> > availability by > > > Sept 6th if your interested. > > > >>> > > > > >>> > 1: > > > > https://s.apache.org/beam-bundles-backlog-splitting > > > >>> > 2: > > > https://s.apache.org/beam-breaking-fusion > > > >>> > > > > >>> > On Mon, Aug 13, > > > 2018 at 10:21 AM Jean-Baptiste > > > >>> > Onofré > > > <[email protected] <mailto:[email protected]> > > > <mailto:[email protected] > > > <mailto:[email protected]>>> wrote: > > > >>> > > > > >>> > Awesome ! > > > >>> > > > > >>> > Thanks Luke ! > > > >>> > > > > >>> > I plan to > work > > > with you and others on this one. > > > >>> > > > > >>> > Regards > > > >>> > JB > > > >>> > Le 13 août > > > 2018, à 19:14, Lukasz Cwik > > > >>> > > > > <[email protected] <mailto: > [email protected]> > > > <mailto:[email protected] > > > <mailto:[email protected]>>> a > > > >>> > écrit: > > > >>> > > > > >>> > I wanted > > > to reach out that I will be > > > >>> > > continuing > > > from where Eugene left off with > > > >>> > > > > SplittableDoFn. I know that many of you > have > > > >>> > done a > > > bunch of work with IOs and/or runner > > > >>> > > > > integration for SplittableDoFn and would > > > >>> > > appreciate > > > your help in advancing this > > > >>> > awesome > > > idea. If you have questions or > > > >>> > things > you > > > want to get reviewed related to > > > >>> > > > > SplittableDoFn, feel free to send them my > > > >>> > way or > > > include me on anything SplittableDoFn > > > >>> > related. > > > >>> > > > > >>> > I was > part > > > of several discussions with > > > >>> > Eugene > and > > > I think the biggest outstanding > > > >>> > design > > > portion is to figure out how dynamic > > > >>> > work > > > rebalancing would play out with the > > > >>> > > > > portability APIs. This includes reporting > of > > > >>> > progress > > > from within a bundle. I know that > > > >>> > Eugene > had > > > shared some documents in this > > > >>> > regard > but > > > the position / split models > > > >>> > didn't > > > work too cleanly in a unified sense > > > >>> > for > > > bounded and unbounded SplittableDoFns. > > > >>> > It will > > > likely take me awhile to gather my > > > >>> > thoughts > > > but could use your expertise as to > > > >>> > how > > > compatible these ideas are with respect > > > >>> > to to IOs > > > and runners > > > >>> > > > > Flink/Spark/Dataflow/Samza/Apex/... and > > > >>> > obviously > > > help during implementation. > > > >>> > > > > > > > > -- > > Jean-Baptiste Onofré > > [email protected] > > http://blog.nanthrax.net > > Talend - http://www.talend.com >
