On Tue, 11 Aug 2020 at 04:09, Callum Styan <[email protected]> wrote:
> I'm hesitant to add anything that significantly increases the network > bandwidth usage or remote write while at the same time not giving users a > way to tune the usage to their needs. > > I agree with Brian that we don't want the protocol itself to become > stateful by introducing something like negotiation. I'd also prefer not to > introduce multiple ways to do things, though I'm hoping we can find a way > to accommodate your use case while not ballooning average users network > egress bill. > > I am fine with forcing the consuming end to be somewhat stateful like in > the case of Josh's PR where all metadata is sent periodically and must be > stored by the remote storage system. > > Overall I'd like to see some numbers regarding current network bandwidth > of remote write, remote write with metadata via Josh's PR, and remote write > with sending metadata for every series in a remote write payload. > I agree, I noticed that in Rob's PR and had the same thought. Brian > > Rob, I'll review your PR tomorrow but it looks like Julien and Brian may > already have that covered. > > On Sun, Aug 9, 2020 at 9:36 PM Rob Skillington <[email protected]> > wrote: > >> Update: The PR now sends the fields over remote write from the WAL and >> metadata >> is also updated in the WAL when any field changes. >> >> Now opened the PR against the primary repo: >> https://github.com/prometheus/prometheus/pull/7771 >> >> I have tested this end-to-end with a modified M3 branch: >> https://github.com/m3db/m3/compare/r/test-prometheus-metadata >> > {... "msg":"received >> series","labels":"{__name__="prometheus_rule_group_... >> > iterations_total",instance="localhost:9090",job="prometheus01",role=... >> > "remote"}","type":"counter","unit":"","help":"The total number of >> scheduled... >> > rule group evaluations, whether executed or missed."} >> >> Tests still haven't been updated. Please any feedback on the approach / >> data structures would be greatly appreciated. >> >> Would be good to know what others thoughts are on next steps. >> >> On Sat, Aug 8, 2020 at 11:21 AM Rob Skillington <[email protected]> >> wrote: >> >>> Here's a draft PR that builds that propagates metadata to the WAL and >>> the WAL >>> reader can read it back: >>> https://github.com/robskillington/prometheus/pull/1/files >>> >>> Would like a little bit of feedback before on the datatypes and >>> structure going >>> further if folks are open to that. >>> >>> There's a few things not happening: >>> - Remote write queue manager does not use or send these extra fields yet. >>> - Head does not reset the "metadata" slice (not sure where "series" >>> slice is >>> reset in the head for pending series writes to WAL, want to do in same >>> place). >>> - Metadata is not re-written on change yet. >>> - Tests. >>> >>> >>> On Sat, Aug 8, 2020 at 9:37 AM Rob Skillington <[email protected]> >>> wrote: >>> >>>> Sounds good, I've updated the proposal with details on places in which >>>> changes >>>> are required given the new approach: >>>> >>>> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo/edit# >>>> >>>> >>>> On Fri, Aug 7, 2020 at 2:09 PM Brian Brazil < >>>> [email protected]> wrote: >>>> >>>>> On Fri, 7 Aug 2020 at 15:48, Rob Skillington <[email protected]> >>>>> wrote: >>>>> >>>>>> True - I mean this could also be a blacklist by config perhaps, so if >>>>>> you >>>>>> really don't want to have increased egress you can optionally turn >>>>>> off sending >>>>>> the TYPE, HELP, UNIT or send them at different frequencies via >>>>>> config. We could >>>>>> package some sensible defaults so folks don't need to update their >>>>>> config. >>>>>> >>>>>> The main intention is to enable these added features and make it >>>>>> possible for >>>>>> various consumers to be able to adjust some of these parameters if >>>>>> required >>>>>> since backends can be so different in their implementation. For M3 I >>>>>> would be >>>>>> totally fine with the extra egress that should be mitigated fairly >>>>>> considerably >>>>>> by Snappy and the fact that HELP is common across certain metric >>>>>> families and >>>>>> receiving it every single Remote Write request. >>>>>> >>>>> >>>>> That's really a micro-optimisation. If you are that worried about >>>>> bandwidth you'd run a sidecar specific to your remote backend that was >>>>> stateful and far more efficient overall. Sending the full label names and >>>>> values on every request is going to be far more than the overhead of >>>>> metadata on top of that, so I don't see a need as it stands for any of >>>>> this >>>>> to be configurable. >>>>> >>>>> Brian >>>>> >>>>> >>>>>> >>>>>> On Fri, Aug 7, 2020 at 3:56 AM Brian Brazil < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> On Thu, 6 Aug 2020 at 22:58, Rob Skillington <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hey Björn, >>>>>>>> >>>>>>>> >>>>>>>> Thanks for the detailed response. I've had a few back and forths on >>>>>>>> this with >>>>>>>> Brian and Chris over IRC and CNCF Slack now too. >>>>>>>> >>>>>>>> I agree that fundamentally it seems naive to idealistically model >>>>>>>> this around >>>>>>>> per metric name. It needs to be per series given what may happen >>>>>>>> w.r.t. >>>>>>>> collision across targets, etc. >>>>>>>> >>>>>>>> Perhaps we can separate these discussions apart into two >>>>>>>> considerations: >>>>>>>> >>>>>>>> 1) Modeling of the data such that it is kept around for >>>>>>>> transmission (primarily >>>>>>>> we're focused on WAL here). >>>>>>>> >>>>>>>> 2) Transmission (and of which you allude to has many areas for >>>>>>>> improvement). >>>>>>>> >>>>>>>> For (1) - it seems like this needs to be done per time series, >>>>>>>> thankfully we >>>>>>>> actually already have modeled this to be stored per series data >>>>>>>> just once in a >>>>>>>> single WAL file. I will write up my proposal here, but it will >>>>>>>> surmount to >>>>>>>> essentially encoding the HELP, UNIT and TYPE to the WAL per series >>>>>>>> similar to >>>>>>>> how labels for a series are encoded once per series in the WAL. >>>>>>>> Since this >>>>>>>> optimization is in place, there's already a huge dampening effect >>>>>>>> on how >>>>>>>> expensive it is to write out data about a series (e.g. labels). We >>>>>>>> can always >>>>>>>> go and collect a sample WAL file and measure how much extra size >>>>>>>> with/without >>>>>>>> HELP, UNIT and TYPE this would add, but it seems like it won't >>>>>>>> fundamentally >>>>>>>> change the order of magnitude in terms of "information about a >>>>>>>> timeseries >>>>>>>> storage size" vs "datapoints about a timeseries storage size". One >>>>>>>> extra change >>>>>>>> would be re-encoding the series into the WAL if the HELP changed >>>>>>>> for that >>>>>>>> series, just so that when HELP does change it can be up to date >>>>>>>> from the view >>>>>>>> of whoever is reading the WAL (i.e. the Remote Write loop). Since >>>>>>>> this entry >>>>>>>> needs to be loaded into memory for Remote Write today anyway, with >>>>>>>> string >>>>>>>> interning as suggested by Chris, it won't change the memory profile >>>>>>>> algorithmically of a Prometheus with Remote Write enabled. There >>>>>>>> will be some >>>>>>>> overhead that at most would likely be similar to the label data, >>>>>>>> but we aren't >>>>>>>> altering data structures (so won't change big-O magnitude of memory >>>>>>>> being used), >>>>>>>> we're adding fields to existing data structures that exist and >>>>>>>> string interning >>>>>>>> should actually make it much less onerous since there is a large >>>>>>>> duplicative >>>>>>>> effect with HELP among time series. >>>>>>>> >>>>>>>> For (2) - now we have basically TYPE, HELP and UNIT all available >>>>>>>> for >>>>>>>> transmission if we wanted to send it with every single datapoint. >>>>>>>> While I think >>>>>>>> we should definitely examine HPACK like compression features as you >>>>>>>> mentioned >>>>>>>> Björn, I think we should think more about separating that kind of >>>>>>>> work into a >>>>>>>> Milestone 2 where this is considered. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> For the time being it's very plausible >>>>>>>> we could do some negotiation of the receiving Remote Write endpoint >>>>>>>> by sending >>>>>>>> a "GET" to the remote write endpoint and seeing if it responds with >>>>>>>> a >>>>>>>> "capabilities + preferences" response, and if the endpoint >>>>>>>> specifies that it >>>>>>>> would like to receive metadata all the time on every single request >>>>>>>> and let >>>>>>>> Snappy take care of keeping size not ballooning too much, or if it >>>>>>>> would like >>>>>>>> TYPE on every single datapoint, and HELP and UNIT every >>>>>>>> DESIRED_SECONDS or so. >>>>>>>> To enable a "send HELP every 10 minutes" feature we would have to >>>>>>>> add to the >>>>>>>> datastructure that holds the LABELS, TYPE, HELP and UNIT for each >>>>>>>> series a >>>>>>>> "last sent" timestamp to know when to resend to that backend, but >>>>>>>> that seems >>>>>>>> entirely plausible and would not use more than 4 extra bytes. >>>>>>>> >>>>>>> >>>>>>> Negotiation is fundamentally stateful, as the process that receives >>>>>>> the first request may be a very different one from the one that receives >>>>>>> the second - such as if an upgrade is in progress. Remote write is >>>>>>> intended >>>>>>> to be a very simple thing that's easy to implement on the receiver end >>>>>>> and >>>>>>> is a send-only request-based protocol, so request-time negotiation is >>>>>>> basically out. Any negotiation needs to happen via the config file, and >>>>>>> even then it'd be better if nothing ever needed to be configured. >>>>>>> Getting >>>>>>> all the users of a remote write to change their config file or restart >>>>>>> all >>>>>>> their Prometheus servers is not an easy task after all. >>>>>>> >>>>>>> Brian >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> These thoughts are based on the discussion I've had and the >>>>>>>> thoughts on this >>>>>>>> thread. What's the feedback on this before I go ahead and >>>>>>>> re-iterate the design >>>>>>>> to more closely map to what I'm suggesting here? >>>>>>>> >>>>>>>> Best, >>>>>>>> Rob >>>>>>>> >>>>>>>> On Thu, Aug 6, 2020 at 2:01 PM Bjoern Rabenstein < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> On 03.08.20 03:04, Rob Skillington wrote: >>>>>>>>> > Ok - I have a proposal which could be broken up into two pieces, >>>>>>>>> first >>>>>>>>> > delivering TYPE per datapoint, the second consistently and >>>>>>>>> reliably HELP and >>>>>>>>> > UNIT once per unique metric name: >>>>>>>>> > >>>>>>>>> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo >>>>>>>>> > /edit#heading=h.bik9uwphqy3g >>>>>>>>> >>>>>>>>> Thanks for the doc. I have commented on it, but while doing so, I >>>>>>>>> felt >>>>>>>>> the urge to comment more generally, which would not fit well into >>>>>>>>> the >>>>>>>>> margin of a Google doc. My thoughts are also a bit out of scope of >>>>>>>>> Rob's design doc and more about the general topic of remote write >>>>>>>>> and >>>>>>>>> the equally general topic of metadata (about which we have an >>>>>>>>> ongoing >>>>>>>>> discussion among the Prometheus developers). >>>>>>>>> >>>>>>>>> Disclaimer: I don't know the remote-write protocol very well. My >>>>>>>>> hope >>>>>>>>> here is that my somewhat distant perspective is of some value as it >>>>>>>>> allows to take a step back. However, I might just miss crucial >>>>>>>>> details >>>>>>>>> that completely invalidate my thoughts. We'll see... >>>>>>>>> >>>>>>>>> I do care a lot about metadata, though. (And ironically, the reason >>>>>>>>> why I declared remote write "somebody else's problem" is that I've >>>>>>>>> always disliked how it fundamentally ignores metadata.) >>>>>>>>> >>>>>>>>> Rob's document embraces the fact that metadata can change over >>>>>>>>> time, >>>>>>>>> but it assumes that at any given time, there is only one set of >>>>>>>>> metadata per unique metric name. It takes into account that there >>>>>>>>> can >>>>>>>>> be drift, but it considers them an irregularity that will only >>>>>>>>> happen >>>>>>>>> occasionally and iron out over time. >>>>>>>>> >>>>>>>>> In practice, however, metadata can be legitimately and deliberately >>>>>>>>> different for different time series of the same name. >>>>>>>>> Instrumentation >>>>>>>>> libraries and even the exposition format inherently require one >>>>>>>>> set of >>>>>>>>> metadata per metric name, but this is all only enforced (and meant >>>>>>>>> to >>>>>>>>> be enforced) _per target_. Once the samples are ingested (or even >>>>>>>>> sent >>>>>>>>> onwards via remote write), they have no notion of what target they >>>>>>>>> came from. Furthermore, samples created by rule evaluation don't >>>>>>>>> have >>>>>>>>> an originating target in the first place. (Which raises the >>>>>>>>> question >>>>>>>>> of metadata for recording rules, which is another can of worms I'd >>>>>>>>> like to open eventually...) >>>>>>>>> >>>>>>>>> (There is also the technical difficulty that the WAL has no notion >>>>>>>>> of >>>>>>>>> bundling or referencing all the series with the same metric name. >>>>>>>>> That >>>>>>>>> was commented about in the doc but is not my focus here.) >>>>>>>>> >>>>>>>>> Rob's doc sees TYPE as special because it is so cheap to just add >>>>>>>>> to >>>>>>>>> every data point. That's correct, but it's giving me an itch: >>>>>>>>> Should >>>>>>>>> we really create different ways of handling metadata, depending on >>>>>>>>> its >>>>>>>>> expected size? >>>>>>>>> >>>>>>>>> Compare this with labels. There is no upper limit to their number >>>>>>>>> or >>>>>>>>> size. Still, we have no plan of treating "large" labels differently >>>>>>>>> from "short" labels. >>>>>>>>> >>>>>>>>> On top of that, we have by now gained the insight that metadata is >>>>>>>>> changing over time and essentially has to be tracked per series. >>>>>>>>> >>>>>>>>> Or in other words: From a pure storage perspective, metadata >>>>>>>>> behaves >>>>>>>>> exactly the same as labels! (There are certainly huge differences >>>>>>>>> semantically, but those only manifest themselves on the query >>>>>>>>> level, >>>>>>>>> i.e. how you treat it in PromQL etc.) >>>>>>>>> >>>>>>>>> (This is not exactly a new insight. This is more or less what I >>>>>>>>> said >>>>>>>>> during the 2016 dev summit, when we first discussed remote write. >>>>>>>>> But >>>>>>>>> I don't want to dwell on "told you so" moments... :o) >>>>>>>>> >>>>>>>>> There is a good reason why we don't just add metadata as "pseudo >>>>>>>>> labels": As discussed a lot in the various design docs including >>>>>>>>> Rob's >>>>>>>>> one, it would blow up the data size significantly because HELP >>>>>>>>> strings >>>>>>>>> tend to be relatively long. >>>>>>>>> >>>>>>>>> And that's the point where I would like to take a step back: We are >>>>>>>>> discussing to essentially treat something that is structurally the >>>>>>>>> same thing in three different ways: Way 1 for labels as we know >>>>>>>>> them. Way 2 for "small" metadata. Way 3 for "big" metadata. >>>>>>>>> >>>>>>>>> However, while labels tend to be shorter than HELP strings, there >>>>>>>>> is >>>>>>>>> the occasional use case with long or many labels. (Infamously, at >>>>>>>>> SoundCloud, a binary accidentally put a whole HTML page into a >>>>>>>>> label. That wasn't a use case, it was a bug, but the Prometheus >>>>>>>>> server >>>>>>>>> ingesting that was just chugging along as if nothing special had >>>>>>>>> happened. It looked weird in the expression browser, though...) I'm >>>>>>>>> sure any vendor offering Prometheus remote storage as a service >>>>>>>>> will >>>>>>>>> have a customer or two that use excessively long label names. If we >>>>>>>>> have to deal with that, why not bite the bullet and treat metadata >>>>>>>>> in >>>>>>>>> the same way as labels in general? Or to phrase it in another way: >>>>>>>>> Any >>>>>>>>> solution for "big" metadata could be used for labels, too, to >>>>>>>>> alleviate the pain with excessively long label names. >>>>>>>>> >>>>>>>>> Or most succintly: A robust and really good solution for >>>>>>>>> "big" metadata in remote write will make remote write much more >>>>>>>>> efficient if applied to labels, too. >>>>>>>>> >>>>>>>>> Imagine an NALSD tech interview question that boils down to "design >>>>>>>>> Prometheus remote write". I bet that most of the better candidates >>>>>>>>> will recognize that most of the payload will consist of series >>>>>>>>> indentifiers (call them labels or whatever) and they will suggest >>>>>>>>> to >>>>>>>>> first transmit some kind of index and from then on only transmit >>>>>>>>> short >>>>>>>>> series IDs. The best candidates will then find out about all the >>>>>>>>> problems with that: How to keep the protocol stateless, how to >>>>>>>>> re-sync >>>>>>>>> the index, how to update it if new series arrive etc. Those are >>>>>>>>> certainly all good reasons why remote write as we know it does not >>>>>>>>> transfer an index of series IDs. >>>>>>>>> >>>>>>>>> However, my point here is that we are now discussing exactly those >>>>>>>>> problems when we talk about metadata transmission. Let's solve >>>>>>>>> those >>>>>>>>> problems and apply them to remote write in general! >>>>>>>>> >>>>>>>>> Some thoughts about that: >>>>>>>>> >>>>>>>>> Current remote write essentially transfers all labels for _every_ >>>>>>>>> sample. This works reasonably well. Even if metadata blows up the >>>>>>>>> data >>>>>>>>> size by 5x or 10x, transfering the whole index of metadata and >>>>>>>>> labels >>>>>>>>> should remain feasible as long as we do it less frequently than >>>>>>>>> once >>>>>>>>> every 10 samples. It's something that could be done each time a >>>>>>>>> remote-write receiver connects. From then on, we "only" have to >>>>>>>>> track >>>>>>>>> when new series (or series with new metadata) show up and transfer >>>>>>>>> those. (I know it's not trivial, but we are already discussing >>>>>>>>> possible solutions in the various design docs.) Whenever a >>>>>>>>> remote-write receiver gets out of sync for some reason, it can >>>>>>>>> simply >>>>>>>>> cut the connection and start with a complete re-sync again. As >>>>>>>>> long as >>>>>>>>> that doesn't happen more often than once every 10 samples, we still >>>>>>>>> have a net gain. Combining this with sharding is another challenge, >>>>>>>>> but it doesn't appear unsolveable. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Björn Rabenstein >>>>>>>>> [PGP-ID] 0x851C3DA17D748D03 >>>>>>>>> [email] [email protected] >>>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "Prometheus Developers" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected] >>>>>>>> . >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com >>>>>>>> <https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Brian Brazil >>>>>>> www.robustperception.io >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Brian Brazil >>>>> www.robustperception.io >>>>> >>>> -- >> You received this message because you are subscribed to the Google Groups >> "Prometheus Developers" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/prometheus-developers/CABakzZb%2BX-ErewAKEyg54_FVRmTSypbnNFmR-8ZayfU_WiTMFw%40mail.gmail.com >> <https://groups.google.com/d/msgid/prometheus-developers/CABakzZb%2BX-ErewAKEyg54_FVRmTSypbnNFmR-8ZayfU_WiTMFw%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- Brian Brazil www.robustperception.io -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAHJKeLouK0PKQMpmuWibEs3%3DDyrEXfN%2BbiUygfak4S_h0k30pw%40mail.gmail.com.

