Here's a draft PR that builds that propagates metadata to the WAL and the WAL reader can read it back: https://github.com/robskillington/prometheus/pull/1/files
Would like a little bit of feedback before on the datatypes and structure going further if folks are open to that. There's a few things not happening: - Remote write queue manager does not use or send these extra fields yet. - Head does not reset the "metadata" slice (not sure where "series" slice is reset in the head for pending series writes to WAL, want to do in same place). - Metadata is not re-written on change yet. - Tests. On Sat, Aug 8, 2020 at 9:37 AM Rob Skillington <[email protected]> wrote: > Sounds good, I've updated the proposal with details on places in which > changes > are required given the new approach: > > https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo/edit# > > > On Fri, Aug 7, 2020 at 2:09 PM Brian Brazil < > [email protected]> wrote: > >> On Fri, 7 Aug 2020 at 15:48, Rob Skillington <[email protected]> wrote: >> >>> True - I mean this could also be a blacklist by config perhaps, so if >>> you >>> really don't want to have increased egress you can optionally turn off >>> sending >>> the TYPE, HELP, UNIT or send them at different frequencies via config. >>> We could >>> package some sensible defaults so folks don't need to update their >>> config. >>> >>> The main intention is to enable these added features and make it >>> possible for >>> various consumers to be able to adjust some of these parameters if >>> required >>> since backends can be so different in their implementation. For M3 I >>> would be >>> totally fine with the extra egress that should be mitigated fairly >>> considerably >>> by Snappy and the fact that HELP is common across certain metric >>> families and >>> receiving it every single Remote Write request. >>> >> >> That's really a micro-optimisation. If you are that worried about >> bandwidth you'd run a sidecar specific to your remote backend that was >> stateful and far more efficient overall. Sending the full label names and >> values on every request is going to be far more than the overhead of >> metadata on top of that, so I don't see a need as it stands for any of this >> to be configurable. >> >> Brian >> >> >>> >>> On Fri, Aug 7, 2020 at 3:56 AM Brian Brazil < >>> [email protected]> wrote: >>> >>>> On Thu, 6 Aug 2020 at 22:58, Rob Skillington <[email protected]> >>>> wrote: >>>> >>>>> Hey Björn, >>>>> >>>>> >>>>> Thanks for the detailed response. I've had a few back and forths on >>>>> this with >>>>> Brian and Chris over IRC and CNCF Slack now too. >>>>> >>>>> I agree that fundamentally it seems naive to idealistically model this >>>>> around >>>>> per metric name. It needs to be per series given what may happen >>>>> w.r.t. >>>>> collision across targets, etc. >>>>> >>>>> Perhaps we can separate these discussions apart into two >>>>> considerations: >>>>> >>>>> 1) Modeling of the data such that it is kept around for transmission >>>>> (primarily >>>>> we're focused on WAL here). >>>>> >>>>> 2) Transmission (and of which you allude to has many areas for >>>>> improvement). >>>>> >>>>> For (1) - it seems like this needs to be done per time series, >>>>> thankfully we >>>>> actually already have modeled this to be stored per series data just >>>>> once in a >>>>> single WAL file. I will write up my proposal here, but it will >>>>> surmount to >>>>> essentially encoding the HELP, UNIT and TYPE to the WAL per series >>>>> similar to >>>>> how labels for a series are encoded once per series in the WAL. Since >>>>> this >>>>> optimization is in place, there's already a huge dampening effect on >>>>> how >>>>> expensive it is to write out data about a series (e.g. labels). We can >>>>> always >>>>> go and collect a sample WAL file and measure how much extra size >>>>> with/without >>>>> HELP, UNIT and TYPE this would add, but it seems like it won't >>>>> fundamentally >>>>> change the order of magnitude in terms of "information about a >>>>> timeseries >>>>> storage size" vs "datapoints about a timeseries storage size". One >>>>> extra change >>>>> would be re-encoding the series into the WAL if the HELP changed for >>>>> that >>>>> series, just so that when HELP does change it can be up to date from >>>>> the view >>>>> of whoever is reading the WAL (i.e. the Remote Write loop). Since this >>>>> entry >>>>> needs to be loaded into memory for Remote Write today anyway, with >>>>> string >>>>> interning as suggested by Chris, it won't change the memory profile >>>>> algorithmically of a Prometheus with Remote Write enabled. There will >>>>> be some >>>>> overhead that at most would likely be similar to the label data, but >>>>> we aren't >>>>> altering data structures (so won't change big-O magnitude of memory >>>>> being used), >>>>> we're adding fields to existing data structures that exist and string >>>>> interning >>>>> should actually make it much less onerous since there is a large >>>>> duplicative >>>>> effect with HELP among time series. >>>>> >>>>> For (2) - now we have basically TYPE, HELP and UNIT all available for >>>>> transmission if we wanted to send it with every single datapoint. >>>>> While I think >>>>> we should definitely examine HPACK like compression features as you >>>>> mentioned >>>>> Björn, I think we should think more about separating that kind of work >>>>> into a >>>>> Milestone 2 where this is considered. >>>>> >>>> >>>> >>>> >>>>> For the time being it's very plausible >>>>> we could do some negotiation of the receiving Remote Write endpoint by >>>>> sending >>>>> a "GET" to the remote write endpoint and seeing if it responds with a >>>>> "capabilities + preferences" response, and if the endpoint specifies >>>>> that it >>>>> would like to receive metadata all the time on every single request >>>>> and let >>>>> Snappy take care of keeping size not ballooning too much, or if it >>>>> would like >>>>> TYPE on every single datapoint, and HELP and UNIT every >>>>> DESIRED_SECONDS or so. >>>>> To enable a "send HELP every 10 minutes" feature we would have to add >>>>> to the >>>>> datastructure that holds the LABELS, TYPE, HELP and UNIT for each >>>>> series a >>>>> "last sent" timestamp to know when to resend to that backend, but that >>>>> seems >>>>> entirely plausible and would not use more than 4 extra bytes. >>>>> >>>> >>>> Negotiation is fundamentally stateful, as the process that receives the >>>> first request may be a very different one from the one that receives the >>>> second - such as if an upgrade is in progress. Remote write is intended to >>>> be a very simple thing that's easy to implement on the receiver end and is >>>> a send-only request-based protocol, so request-time negotiation is >>>> basically out. Any negotiation needs to happen via the config file, and >>>> even then it'd be better if nothing ever needed to be configured. Getting >>>> all the users of a remote write to change their config file or restart all >>>> their Prometheus servers is not an easy task after all. >>>> >>>> Brian >>>> >>>> >>>>> >>>>> These thoughts are based on the discussion I've had and the thoughts >>>>> on this >>>>> thread. What's the feedback on this before I go ahead and re-iterate >>>>> the design >>>>> to more closely map to what I'm suggesting here? >>>>> >>>>> Best, >>>>> Rob >>>>> >>>>> On Thu, Aug 6, 2020 at 2:01 PM Bjoern Rabenstein <[email protected]> >>>>> wrote: >>>>> >>>>>> On 03.08.20 03:04, Rob Skillington wrote: >>>>>> > Ok - I have a proposal which could be broken up into two pieces, >>>>>> first >>>>>> > delivering TYPE per datapoint, the second consistently and reliably >>>>>> HELP and >>>>>> > UNIT once per unique metric name: >>>>>> > >>>>>> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo >>>>>> > /edit#heading=h.bik9uwphqy3g >>>>>> >>>>>> Thanks for the doc. I have commented on it, but while doing so, I felt >>>>>> the urge to comment more generally, which would not fit well into the >>>>>> margin of a Google doc. My thoughts are also a bit out of scope of >>>>>> Rob's design doc and more about the general topic of remote write and >>>>>> the equally general topic of metadata (about which we have an ongoing >>>>>> discussion among the Prometheus developers). >>>>>> >>>>>> Disclaimer: I don't know the remote-write protocol very well. My hope >>>>>> here is that my somewhat distant perspective is of some value as it >>>>>> allows to take a step back. However, I might just miss crucial details >>>>>> that completely invalidate my thoughts. We'll see... >>>>>> >>>>>> I do care a lot about metadata, though. (And ironically, the reason >>>>>> why I declared remote write "somebody else's problem" is that I've >>>>>> always disliked how it fundamentally ignores metadata.) >>>>>> >>>>>> Rob's document embraces the fact that metadata can change over time, >>>>>> but it assumes that at any given time, there is only one set of >>>>>> metadata per unique metric name. It takes into account that there can >>>>>> be drift, but it considers them an irregularity that will only happen >>>>>> occasionally and iron out over time. >>>>>> >>>>>> In practice, however, metadata can be legitimately and deliberately >>>>>> different for different time series of the same name. Instrumentation >>>>>> libraries and even the exposition format inherently require one set of >>>>>> metadata per metric name, but this is all only enforced (and meant to >>>>>> be enforced) _per target_. Once the samples are ingested (or even sent >>>>>> onwards via remote write), they have no notion of what target they >>>>>> came from. Furthermore, samples created by rule evaluation don't have >>>>>> an originating target in the first place. (Which raises the question >>>>>> of metadata for recording rules, which is another can of worms I'd >>>>>> like to open eventually...) >>>>>> >>>>>> (There is also the technical difficulty that the WAL has no notion of >>>>>> bundling or referencing all the series with the same metric name. That >>>>>> was commented about in the doc but is not my focus here.) >>>>>> >>>>>> Rob's doc sees TYPE as special because it is so cheap to just add to >>>>>> every data point. That's correct, but it's giving me an itch: Should >>>>>> we really create different ways of handling metadata, depending on its >>>>>> expected size? >>>>>> >>>>>> Compare this with labels. There is no upper limit to their number or >>>>>> size. Still, we have no plan of treating "large" labels differently >>>>>> from "short" labels. >>>>>> >>>>>> On top of that, we have by now gained the insight that metadata is >>>>>> changing over time and essentially has to be tracked per series. >>>>>> >>>>>> Or in other words: From a pure storage perspective, metadata behaves >>>>>> exactly the same as labels! (There are certainly huge differences >>>>>> semantically, but those only manifest themselves on the query level, >>>>>> i.e. how you treat it in PromQL etc.) >>>>>> >>>>>> (This is not exactly a new insight. This is more or less what I said >>>>>> during the 2016 dev summit, when we first discussed remote write. But >>>>>> I don't want to dwell on "told you so" moments... :o) >>>>>> >>>>>> There is a good reason why we don't just add metadata as "pseudo >>>>>> labels": As discussed a lot in the various design docs including Rob's >>>>>> one, it would blow up the data size significantly because HELP strings >>>>>> tend to be relatively long. >>>>>> >>>>>> And that's the point where I would like to take a step back: We are >>>>>> discussing to essentially treat something that is structurally the >>>>>> same thing in three different ways: Way 1 for labels as we know >>>>>> them. Way 2 for "small" metadata. Way 3 for "big" metadata. >>>>>> >>>>>> However, while labels tend to be shorter than HELP strings, there is >>>>>> the occasional use case with long or many labels. (Infamously, at >>>>>> SoundCloud, a binary accidentally put a whole HTML page into a >>>>>> label. That wasn't a use case, it was a bug, but the Prometheus server >>>>>> ingesting that was just chugging along as if nothing special had >>>>>> happened. It looked weird in the expression browser, though...) I'm >>>>>> sure any vendor offering Prometheus remote storage as a service will >>>>>> have a customer or two that use excessively long label names. If we >>>>>> have to deal with that, why not bite the bullet and treat metadata in >>>>>> the same way as labels in general? Or to phrase it in another way: Any >>>>>> solution for "big" metadata could be used for labels, too, to >>>>>> alleviate the pain with excessively long label names. >>>>>> >>>>>> Or most succintly: A robust and really good solution for >>>>>> "big" metadata in remote write will make remote write much more >>>>>> efficient if applied to labels, too. >>>>>> >>>>>> Imagine an NALSD tech interview question that boils down to "design >>>>>> Prometheus remote write". I bet that most of the better candidates >>>>>> will recognize that most of the payload will consist of series >>>>>> indentifiers (call them labels or whatever) and they will suggest to >>>>>> first transmit some kind of index and from then on only transmit short >>>>>> series IDs. The best candidates will then find out about all the >>>>>> problems with that: How to keep the protocol stateless, how to re-sync >>>>>> the index, how to update it if new series arrive etc. Those are >>>>>> certainly all good reasons why remote write as we know it does not >>>>>> transfer an index of series IDs. >>>>>> >>>>>> However, my point here is that we are now discussing exactly those >>>>>> problems when we talk about metadata transmission. Let's solve those >>>>>> problems and apply them to remote write in general! >>>>>> >>>>>> Some thoughts about that: >>>>>> >>>>>> Current remote write essentially transfers all labels for _every_ >>>>>> sample. This works reasonably well. Even if metadata blows up the data >>>>>> size by 5x or 10x, transfering the whole index of metadata and labels >>>>>> should remain feasible as long as we do it less frequently than once >>>>>> every 10 samples. It's something that could be done each time a >>>>>> remote-write receiver connects. From then on, we "only" have to track >>>>>> when new series (or series with new metadata) show up and transfer >>>>>> those. (I know it's not trivial, but we are already discussing >>>>>> possible solutions in the various design docs.) Whenever a >>>>>> remote-write receiver gets out of sync for some reason, it can simply >>>>>> cut the connection and start with a complete re-sync again. As long as >>>>>> that doesn't happen more often than once every 10 samples, we still >>>>>> have a net gain. Combining this with sharding is another challenge, >>>>>> but it doesn't appear unsolveable. >>>>>> >>>>>> -- >>>>>> Björn Rabenstein >>>>>> [PGP-ID] 0x851C3DA17D748D03 >>>>>> [email] [email protected] >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Prometheus Developers" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> >>>> >>>> -- >>>> Brian Brazil >>>> www.robustperception.io >>>> >>> >> >> -- >> Brian Brazil >> www.robustperception.io >> > -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CABakzZbH6Ghod3AWhmE4H_m%3D%2BAepifQstVgt1JbPZD67x4UCTA%40mail.gmail.com.

