On Tue, 11 Aug 2020 at 11:07, Julien Pivotto <[email protected]> wrote:
> On 11 Aug 11:05, Brian Brazil wrote: > > On Tue, 11 Aug 2020 at 04:09, Callum Styan <[email protected]> > wrote: > > > > > I'm hesitant to add anything that significantly increases the network > > > bandwidth usage or remote write while at the same time not giving > users a > > > way to tune the usage to their needs. > > > > > > I agree with Brian that we don't want the protocol itself to become > > > stateful by introducing something like negotiation. I'd also prefer > not to > > > introduce multiple ways to do things, though I'm hoping we can find a > way > > > to accommodate your use case while not ballooning average users network > > > egress bill. > > > > > > I am fine with forcing the consuming end to be somewhat stateful like > in > > > the case of Josh's PR where all metadata is sent periodically and must > be > > > stored by the remote storage system. > > > > > > > > > > > > Overall I'd like to see some numbers regarding current network > bandwidth > > > of remote write, remote write with metadata via Josh's PR, and remote > write > > > with sending metadata for every series in a remote write payload. > > > > > > > I agree, I noticed that in Rob's PR and had the same thought. > > Remote bandwidth are likely to affect only people using remote write. > > Getting a view on the on-disk size of the WAL would be great too, as > that will affect everyone. > I'm not worried about that, it's only really on series creation so won't be noticed unless you have really high levels of churn. Brian > > > > > Brian > > > > > > > > > > Rob, I'll review your PR tomorrow but it looks like Julien and Brian > may > > > already have that covered. > > > > > > On Sun, Aug 9, 2020 at 9:36 PM Rob Skillington <[email protected]> > > > wrote: > > > > > >> Update: The PR now sends the fields over remote write from the WAL and > > >> metadata > > >> is also updated in the WAL when any field changes. > > >> > > >> Now opened the PR against the primary repo: > > >> https://github.com/prometheus/prometheus/pull/7771 > > >> > > >> I have tested this end-to-end with a modified M3 branch: > > >> https://github.com/m3db/m3/compare/r/test-prometheus-metadata > > >> > {... "msg":"received > > >> series","labels":"{__name__="prometheus_rule_group_... > > >> > > iterations_total",instance="localhost:9090",job="prometheus01",role=... > > >> > "remote"}","type":"counter","unit":"","help":"The total number of > > >> scheduled... > > >> > rule group evaluations, whether executed or missed."} > > >> > > >> Tests still haven't been updated. Please any feedback on the approach > / > > >> data structures would be greatly appreciated. > > >> > > >> Would be good to know what others thoughts are on next steps. > > >> > > >> On Sat, Aug 8, 2020 at 11:21 AM Rob Skillington <[email protected]> > > >> wrote: > > >> > > >>> Here's a draft PR that builds that propagates metadata to the WAL and > > >>> the WAL > > >>> reader can read it back: > > >>> https://github.com/robskillington/prometheus/pull/1/files > > >>> > > >>> Would like a little bit of feedback before on the datatypes and > > >>> structure going > > >>> further if folks are open to that. > > >>> > > >>> There's a few things not happening: > > >>> - Remote write queue manager does not use or send these extra fields > yet. > > >>> - Head does not reset the "metadata" slice (not sure where "series" > > >>> slice is > > >>> reset in the head for pending series writes to WAL, want to do in > same > > >>> place). > > >>> - Metadata is not re-written on change yet. > > >>> - Tests. > > >>> > > >>> > > >>> On Sat, Aug 8, 2020 at 9:37 AM Rob Skillington <[email protected]> > > >>> wrote: > > >>> > > >>>> Sounds good, I've updated the proposal with details on places in > which > > >>>> changes > > >>>> are required given the new approach: > > >>>> > > >>>> > https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo/edit# > > >>>> > > >>>> > > >>>> On Fri, Aug 7, 2020 at 2:09 PM Brian Brazil < > > >>>> [email protected]> wrote: > > >>>> > > >>>>> On Fri, 7 Aug 2020 at 15:48, Rob Skillington <[email protected]> > > >>>>> wrote: > > >>>>> > > >>>>>> True - I mean this could also be a blacklist by config perhaps, > so if > > >>>>>> you > > >>>>>> really don't want to have increased egress you can optionally turn > > >>>>>> off sending > > >>>>>> the TYPE, HELP, UNIT or send them at different frequencies via > > >>>>>> config. We could > > >>>>>> package some sensible defaults so folks don't need to update their > > >>>>>> config. > > >>>>>> > > >>>>>> The main intention is to enable these added features and make it > > >>>>>> possible for > > >>>>>> various consumers to be able to adjust some of these parameters if > > >>>>>> required > > >>>>>> since backends can be so different in their implementation. For > M3 I > > >>>>>> would be > > >>>>>> totally fine with the extra egress that should be mitigated fairly > > >>>>>> considerably > > >>>>>> by Snappy and the fact that HELP is common across certain metric > > >>>>>> families and > > >>>>>> receiving it every single Remote Write request. > > >>>>>> > > >>>>> > > >>>>> That's really a micro-optimisation. If you are that worried about > > >>>>> bandwidth you'd run a sidecar specific to your remote backend that > was > > >>>>> stateful and far more efficient overall. Sending the full label > names and > > >>>>> values on every request is going to be far more than the overhead > of > > >>>>> metadata on top of that, so I don't see a need as it stands for > any of this > > >>>>> to be configurable. > > >>>>> > > >>>>> Brian > > >>>>> > > >>>>> > > >>>>>> > > >>>>>> On Fri, Aug 7, 2020 at 3:56 AM Brian Brazil < > > >>>>>> [email protected]> wrote: > > >>>>>> > > >>>>>>> On Thu, 6 Aug 2020 at 22:58, Rob Skillington < > [email protected]> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>>> Hey Björn, > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Thanks for the detailed response. I've had a few back and > forths on > > >>>>>>>> this with > > >>>>>>>> Brian and Chris over IRC and CNCF Slack now too. > > >>>>>>>> > > >>>>>>>> I agree that fundamentally it seems naive to idealistically > model > > >>>>>>>> this around > > >>>>>>>> per metric name. It needs to be per series given what may happen > > >>>>>>>> w.r.t. > > >>>>>>>> collision across targets, etc. > > >>>>>>>> > > >>>>>>>> Perhaps we can separate these discussions apart into two > > >>>>>>>> considerations: > > >>>>>>>> > > >>>>>>>> 1) Modeling of the data such that it is kept around for > > >>>>>>>> transmission (primarily > > >>>>>>>> we're focused on WAL here). > > >>>>>>>> > > >>>>>>>> 2) Transmission (and of which you allude to has many areas for > > >>>>>>>> improvement). > > >>>>>>>> > > >>>>>>>> For (1) - it seems like this needs to be done per time series, > > >>>>>>>> thankfully we > > >>>>>>>> actually already have modeled this to be stored per series data > > >>>>>>>> just once in a > > >>>>>>>> single WAL file. I will write up my proposal here, but it will > > >>>>>>>> surmount to > > >>>>>>>> essentially encoding the HELP, UNIT and TYPE to the WAL per > series > > >>>>>>>> similar to > > >>>>>>>> how labels for a series are encoded once per series in the WAL. > > >>>>>>>> Since this > > >>>>>>>> optimization is in place, there's already a huge dampening > effect > > >>>>>>>> on how > > >>>>>>>> expensive it is to write out data about a series (e.g. labels). > We > > >>>>>>>> can always > > >>>>>>>> go and collect a sample WAL file and measure how much extra size > > >>>>>>>> with/without > > >>>>>>>> HELP, UNIT and TYPE this would add, but it seems like it won't > > >>>>>>>> fundamentally > > >>>>>>>> change the order of magnitude in terms of "information about a > > >>>>>>>> timeseries > > >>>>>>>> storage size" vs "datapoints about a timeseries storage size". > One > > >>>>>>>> extra change > > >>>>>>>> would be re-encoding the series into the WAL if the HELP changed > > >>>>>>>> for that > > >>>>>>>> series, just so that when HELP does change it can be up to date > > >>>>>>>> from the view > > >>>>>>>> of whoever is reading the WAL (i.e. the Remote Write loop). > Since > > >>>>>>>> this entry > > >>>>>>>> needs to be loaded into memory for Remote Write today anyway, > with > > >>>>>>>> string > > >>>>>>>> interning as suggested by Chris, it won't change the memory > profile > > >>>>>>>> algorithmically of a Prometheus with Remote Write enabled. There > > >>>>>>>> will be some > > >>>>>>>> overhead that at most would likely be similar to the label data, > > >>>>>>>> but we aren't > > >>>>>>>> altering data structures (so won't change big-O magnitude of > memory > > >>>>>>>> being used), > > >>>>>>>> we're adding fields to existing data structures that exist and > > >>>>>>>> string interning > > >>>>>>>> should actually make it much less onerous since there is a large > > >>>>>>>> duplicative > > >>>>>>>> effect with HELP among time series. > > >>>>>>>> > > >>>>>>>> For (2) - now we have basically TYPE, HELP and UNIT all > available > > >>>>>>>> for > > >>>>>>>> transmission if we wanted to send it with every single > datapoint. > > >>>>>>>> While I think > > >>>>>>>> we should definitely examine HPACK like compression features as > you > > >>>>>>>> mentioned > > >>>>>>>> Björn, I think we should think more about separating that kind > of > > >>>>>>>> work into a > > >>>>>>>> Milestone 2 where this is considered. > > >>>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>>> For the time being it's very plausible > > >>>>>>>> we could do some negotiation of the receiving Remote Write > endpoint > > >>>>>>>> by sending > > >>>>>>>> a "GET" to the remote write endpoint and seeing if it responds > with > > >>>>>>>> a > > >>>>>>>> "capabilities + preferences" response, and if the endpoint > > >>>>>>>> specifies that it > > >>>>>>>> would like to receive metadata all the time on every single > request > > >>>>>>>> and let > > >>>>>>>> Snappy take care of keeping size not ballooning too much, or if > it > > >>>>>>>> would like > > >>>>>>>> TYPE on every single datapoint, and HELP and UNIT every > > >>>>>>>> DESIRED_SECONDS or so. > > >>>>>>>> To enable a "send HELP every 10 minutes" feature we would have > to > > >>>>>>>> add to the > > >>>>>>>> datastructure that holds the LABELS, TYPE, HELP and UNIT for > each > > >>>>>>>> series a > > >>>>>>>> "last sent" timestamp to know when to resend to that backend, > but > > >>>>>>>> that seems > > >>>>>>>> entirely plausible and would not use more than 4 extra bytes. > > >>>>>>>> > > >>>>>>> > > >>>>>>> Negotiation is fundamentally stateful, as the process that > receives > > >>>>>>> the first request may be a very different one from the one that > receives > > >>>>>>> the second - such as if an upgrade is in progress. Remote write > is intended > > >>>>>>> to be a very simple thing that's easy to implement on the > receiver end and > > >>>>>>> is a send-only request-based protocol, so request-time > negotiation is > > >>>>>>> basically out. Any negotiation needs to happen via the config > file, and > > >>>>>>> even then it'd be better if nothing ever needed to be > configured. Getting > > >>>>>>> all the users of a remote write to change their config file or > restart all > > >>>>>>> their Prometheus servers is not an easy task after all. > > >>>>>>> > > >>>>>>> Brian > > >>>>>>> > > >>>>>>> > > >>>>>>>> > > >>>>>>>> These thoughts are based on the discussion I've had and the > > >>>>>>>> thoughts on this > > >>>>>>>> thread. What's the feedback on this before I go ahead and > > >>>>>>>> re-iterate the design > > >>>>>>>> to more closely map to what I'm suggesting here? > > >>>>>>>> > > >>>>>>>> Best, > > >>>>>>>> Rob > > >>>>>>>> > > >>>>>>>> On Thu, Aug 6, 2020 at 2:01 PM Bjoern Rabenstein < > > >>>>>>>> [email protected]> wrote: > > >>>>>>>> > > >>>>>>>>> On 03.08.20 03:04, Rob Skillington wrote: > > >>>>>>>>> > Ok - I have a proposal which could be broken up into two > pieces, > > >>>>>>>>> first > > >>>>>>>>> > delivering TYPE per datapoint, the second consistently and > > >>>>>>>>> reliably HELP and > > >>>>>>>>> > UNIT once per unique metric name: > > >>>>>>>>> > > > >>>>>>>>> > https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo > > >>>>>>>>> > /edit#heading=h.bik9uwphqy3g > > >>>>>>>>> > > >>>>>>>>> Thanks for the doc. I have commented on it, but while doing > so, I > > >>>>>>>>> felt > > >>>>>>>>> the urge to comment more generally, which would not fit well > into > > >>>>>>>>> the > > >>>>>>>>> margin of a Google doc. My thoughts are also a bit out of > scope of > > >>>>>>>>> Rob's design doc and more about the general topic of remote > write > > >>>>>>>>> and > > >>>>>>>>> the equally general topic of metadata (about which we have an > > >>>>>>>>> ongoing > > >>>>>>>>> discussion among the Prometheus developers). > > >>>>>>>>> > > >>>>>>>>> Disclaimer: I don't know the remote-write protocol very well. > My > > >>>>>>>>> hope > > >>>>>>>>> here is that my somewhat distant perspective is of some value > as it > > >>>>>>>>> allows to take a step back. However, I might just miss crucial > > >>>>>>>>> details > > >>>>>>>>> that completely invalidate my thoughts. We'll see... > > >>>>>>>>> > > >>>>>>>>> I do care a lot about metadata, though. (And ironically, the > reason > > >>>>>>>>> why I declared remote write "somebody else's problem" is that > I've > > >>>>>>>>> always disliked how it fundamentally ignores metadata.) > > >>>>>>>>> > > >>>>>>>>> Rob's document embraces the fact that metadata can change over > > >>>>>>>>> time, > > >>>>>>>>> but it assumes that at any given time, there is only one set of > > >>>>>>>>> metadata per unique metric name. It takes into account that > there > > >>>>>>>>> can > > >>>>>>>>> be drift, but it considers them an irregularity that will only > > >>>>>>>>> happen > > >>>>>>>>> occasionally and iron out over time. > > >>>>>>>>> > > >>>>>>>>> In practice, however, metadata can be legitimately and > deliberately > > >>>>>>>>> different for different time series of the same name. > > >>>>>>>>> Instrumentation > > >>>>>>>>> libraries and even the exposition format inherently require one > > >>>>>>>>> set of > > >>>>>>>>> metadata per metric name, but this is all only enforced (and > meant > > >>>>>>>>> to > > >>>>>>>>> be enforced) _per target_. Once the samples are ingested (or > even > > >>>>>>>>> sent > > >>>>>>>>> onwards via remote write), they have no notion of what target > they > > >>>>>>>>> came from. Furthermore, samples created by rule evaluation > don't > > >>>>>>>>> have > > >>>>>>>>> an originating target in the first place. (Which raises the > > >>>>>>>>> question > > >>>>>>>>> of metadata for recording rules, which is another can of worms > I'd > > >>>>>>>>> like to open eventually...) > > >>>>>>>>> > > >>>>>>>>> (There is also the technical difficulty that the WAL has no > notion > > >>>>>>>>> of > > >>>>>>>>> bundling or referencing all the series with the same metric > name. > > >>>>>>>>> That > > >>>>>>>>> was commented about in the doc but is not my focus here.) > > >>>>>>>>> > > >>>>>>>>> Rob's doc sees TYPE as special because it is so cheap to just > add > > >>>>>>>>> to > > >>>>>>>>> every data point. That's correct, but it's giving me an itch: > > >>>>>>>>> Should > > >>>>>>>>> we really create different ways of handling metadata, > depending on > > >>>>>>>>> its > > >>>>>>>>> expected size? > > >>>>>>>>> > > >>>>>>>>> Compare this with labels. There is no upper limit to their > number > > >>>>>>>>> or > > >>>>>>>>> size. Still, we have no plan of treating "large" labels > differently > > >>>>>>>>> from "short" labels. > > >>>>>>>>> > > >>>>>>>>> On top of that, we have by now gained the insight that > metadata is > > >>>>>>>>> changing over time and essentially has to be tracked per > series. > > >>>>>>>>> > > >>>>>>>>> Or in other words: From a pure storage perspective, metadata > > >>>>>>>>> behaves > > >>>>>>>>> exactly the same as labels! (There are certainly huge > differences > > >>>>>>>>> semantically, but those only manifest themselves on the query > > >>>>>>>>> level, > > >>>>>>>>> i.e. how you treat it in PromQL etc.) > > >>>>>>>>> > > >>>>>>>>> (This is not exactly a new insight. This is more or less what I > > >>>>>>>>> said > > >>>>>>>>> during the 2016 dev summit, when we first discussed remote > write. > > >>>>>>>>> But > > >>>>>>>>> I don't want to dwell on "told you so" moments... :o) > > >>>>>>>>> > > >>>>>>>>> There is a good reason why we don't just add metadata as > "pseudo > > >>>>>>>>> labels": As discussed a lot in the various design docs > including > > >>>>>>>>> Rob's > > >>>>>>>>> one, it would blow up the data size significantly because HELP > > >>>>>>>>> strings > > >>>>>>>>> tend to be relatively long. > > >>>>>>>>> > > >>>>>>>>> And that's the point where I would like to take a step back: > We are > > >>>>>>>>> discussing to essentially treat something that is structurally > the > > >>>>>>>>> same thing in three different ways: Way 1 for labels as we know > > >>>>>>>>> them. Way 2 for "small" metadata. Way 3 for "big" metadata. > > >>>>>>>>> > > >>>>>>>>> However, while labels tend to be shorter than HELP strings, > there > > >>>>>>>>> is > > >>>>>>>>> the occasional use case with long or many labels. (Infamously, > at > > >>>>>>>>> SoundCloud, a binary accidentally put a whole HTML page into a > > >>>>>>>>> label. That wasn't a use case, it was a bug, but the Prometheus > > >>>>>>>>> server > > >>>>>>>>> ingesting that was just chugging along as if nothing special > had > > >>>>>>>>> happened. It looked weird in the expression browser, > though...) I'm > > >>>>>>>>> sure any vendor offering Prometheus remote storage as a service > > >>>>>>>>> will > > >>>>>>>>> have a customer or two that use excessively long label names. > If we > > >>>>>>>>> have to deal with that, why not bite the bullet and treat > metadata > > >>>>>>>>> in > > >>>>>>>>> the same way as labels in general? Or to phrase it in another > way: > > >>>>>>>>> Any > > >>>>>>>>> solution for "big" metadata could be used for labels, too, to > > >>>>>>>>> alleviate the pain with excessively long label names. > > >>>>>>>>> > > >>>>>>>>> Or most succintly: A robust and really good solution for > > >>>>>>>>> "big" metadata in remote write will make remote write much more > > >>>>>>>>> efficient if applied to labels, too. > > >>>>>>>>> > > >>>>>>>>> Imagine an NALSD tech interview question that boils down to > "design > > >>>>>>>>> Prometheus remote write". I bet that most of the better > candidates > > >>>>>>>>> will recognize that most of the payload will consist of series > > >>>>>>>>> indentifiers (call them labels or whatever) and they will > suggest > > >>>>>>>>> to > > >>>>>>>>> first transmit some kind of index and from then on only > transmit > > >>>>>>>>> short > > >>>>>>>>> series IDs. The best candidates will then find out about all > the > > >>>>>>>>> problems with that: How to keep the protocol stateless, how to > > >>>>>>>>> re-sync > > >>>>>>>>> the index, how to update it if new series arrive etc. Those are > > >>>>>>>>> certainly all good reasons why remote write as we know it does > not > > >>>>>>>>> transfer an index of series IDs. > > >>>>>>>>> > > >>>>>>>>> However, my point here is that we are now discussing exactly > those > > >>>>>>>>> problems when we talk about metadata transmission. Let's solve > > >>>>>>>>> those > > >>>>>>>>> problems and apply them to remote write in general! > > >>>>>>>>> > > >>>>>>>>> Some thoughts about that: > > >>>>>>>>> > > >>>>>>>>> Current remote write essentially transfers all labels for > _every_ > > >>>>>>>>> sample. This works reasonably well. Even if metadata blows up > the > > >>>>>>>>> data > > >>>>>>>>> size by 5x or 10x, transfering the whole index of metadata and > > >>>>>>>>> labels > > >>>>>>>>> should remain feasible as long as we do it less frequently than > > >>>>>>>>> once > > >>>>>>>>> every 10 samples. It's something that could be done each time a > > >>>>>>>>> remote-write receiver connects. From then on, we "only" have to > > >>>>>>>>> track > > >>>>>>>>> when new series (or series with new metadata) show up and > transfer > > >>>>>>>>> those. (I know it's not trivial, but we are already discussing > > >>>>>>>>> possible solutions in the various design docs.) Whenever a > > >>>>>>>>> remote-write receiver gets out of sync for some reason, it can > > >>>>>>>>> simply > > >>>>>>>>> cut the connection and start with a complete re-sync again. As > > >>>>>>>>> long as > > >>>>>>>>> that doesn't happen more often than once every 10 samples, we > still > > >>>>>>>>> have a net gain. Combining this with sharding is another > challenge, > > >>>>>>>>> but it doesn't appear unsolveable. > > >>>>>>>>> > > >>>>>>>>> -- > > >>>>>>>>> Björn Rabenstein > > >>>>>>>>> [PGP-ID] 0x851C3DA17D748D03 > > >>>>>>>>> [email] [email protected] > > >>>>>>>>> > > >>>>>>>> -- > > >>>>>>>> You received this message because you are subscribed to the > Google > > >>>>>>>> Groups "Prometheus Developers" group. > > >>>>>>>> To unsubscribe from this group and stop receiving emails from > it, > > >>>>>>>> send an email to > [email protected] > > >>>>>>>> . > > >>>>>>>> To view this discussion on the web visit > > >>>>>>>> > https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com > > >>>>>>>> < > https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com?utm_medium=email&utm_source=footer > > > > >>>>>>>> . > > >>>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> -- > > >>>>>>> Brian Brazil > > >>>>>>> www.robustperception.io > > >>>>>>> > > >>>>>> > > >>>>> > > >>>>> -- > > >>>>> Brian Brazil > > >>>>> www.robustperception.io > > >>>>> > > >>>> -- > > >> You received this message because you are subscribed to the Google > Groups > > >> "Prometheus Developers" group. > > >> To unsubscribe from this group and stop receiving emails from it, > send an > > >> email to [email protected]. > > >> To view this discussion on the web visit > > >> > https://groups.google.com/d/msgid/prometheus-developers/CABakzZb%2BX-ErewAKEyg54_FVRmTSypbnNFmR-8ZayfU_WiTMFw%40mail.gmail.com > > >> < > https://groups.google.com/d/msgid/prometheus-developers/CABakzZb%2BX-ErewAKEyg54_FVRmTSypbnNFmR-8ZayfU_WiTMFw%40mail.gmail.com?utm_medium=email&utm_source=footer > > > > >> . > > >> > > > > > > > -- > > Brian Brazil > > www.robustperception.io > > > > -- > > You received this message because you are subscribed to the Google > Groups "Prometheus Developers" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to [email protected]. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-developers/CAHJKeLouK0PKQMpmuWibEs3%3DDyrEXfN%2BbiUygfak4S_h0k30pw%40mail.gmail.com > . > > -- > Julien Pivotto > @roidelapluie > -- Brian Brazil www.robustperception.io -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAHJKeLrJwxnMOs%3Dyr_0rnzhWJCZEA-CxGnG1pEuUOGP8eRmSUA%40mail.gmail.com.

