Sounds good, I've updated the proposal with details on places in which changes are required given the new approach: https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo/edit#
On Fri, Aug 7, 2020 at 2:09 PM Brian Brazil < [email protected]> wrote: > On Fri, 7 Aug 2020 at 15:48, Rob Skillington <[email protected]> wrote: > >> True - I mean this could also be a blacklist by config perhaps, so if you >> really don't want to have increased egress you can optionally turn off >> sending >> the TYPE, HELP, UNIT or send them at different frequencies via config. We >> could >> package some sensible defaults so folks don't need to update their config. >> >> The main intention is to enable these added features and make it possible >> for >> various consumers to be able to adjust some of these parameters if >> required >> since backends can be so different in their implementation. For M3 I >> would be >> totally fine with the extra egress that should be mitigated fairly >> considerably >> by Snappy and the fact that HELP is common across certain metric families >> and >> receiving it every single Remote Write request. >> > > That's really a micro-optimisation. If you are that worried about > bandwidth you'd run a sidecar specific to your remote backend that was > stateful and far more efficient overall. Sending the full label names and > values on every request is going to be far more than the overhead of > metadata on top of that, so I don't see a need as it stands for any of this > to be configurable. > > Brian > > >> >> On Fri, Aug 7, 2020 at 3:56 AM Brian Brazil < >> [email protected]> wrote: >> >>> On Thu, 6 Aug 2020 at 22:58, Rob Skillington <[email protected]> >>> wrote: >>> >>>> Hey Björn, >>>> >>>> >>>> Thanks for the detailed response. I've had a few back and forths on >>>> this with >>>> Brian and Chris over IRC and CNCF Slack now too. >>>> >>>> I agree that fundamentally it seems naive to idealistically model this >>>> around >>>> per metric name. It needs to be per series given what may happen w.r.t. >>>> collision across targets, etc. >>>> >>>> Perhaps we can separate these discussions apart into two considerations: >>>> >>>> 1) Modeling of the data such that it is kept around for transmission >>>> (primarily >>>> we're focused on WAL here). >>>> >>>> 2) Transmission (and of which you allude to has many areas for >>>> improvement). >>>> >>>> For (1) - it seems like this needs to be done per time series, >>>> thankfully we >>>> actually already have modeled this to be stored per series data just >>>> once in a >>>> single WAL file. I will write up my proposal here, but it will surmount >>>> to >>>> essentially encoding the HELP, UNIT and TYPE to the WAL per series >>>> similar to >>>> how labels for a series are encoded once per series in the WAL. Since >>>> this >>>> optimization is in place, there's already a huge dampening effect on >>>> how >>>> expensive it is to write out data about a series (e.g. labels). We can >>>> always >>>> go and collect a sample WAL file and measure how much extra size >>>> with/without >>>> HELP, UNIT and TYPE this would add, but it seems like it won't >>>> fundamentally >>>> change the order of magnitude in terms of "information about a >>>> timeseries >>>> storage size" vs "datapoints about a timeseries storage size". One >>>> extra change >>>> would be re-encoding the series into the WAL if the HELP changed for >>>> that >>>> series, just so that when HELP does change it can be up to date from >>>> the view >>>> of whoever is reading the WAL (i.e. the Remote Write loop). Since this >>>> entry >>>> needs to be loaded into memory for Remote Write today anyway, with >>>> string >>>> interning as suggested by Chris, it won't change the memory profile >>>> algorithmically of a Prometheus with Remote Write enabled. There will >>>> be some >>>> overhead that at most would likely be similar to the label data, but we >>>> aren't >>>> altering data structures (so won't change big-O magnitude of memory >>>> being used), >>>> we're adding fields to existing data structures that exist and string >>>> interning >>>> should actually make it much less onerous since there is a large >>>> duplicative >>>> effect with HELP among time series. >>>> >>>> For (2) - now we have basically TYPE, HELP and UNIT all available for >>>> transmission if we wanted to send it with every single datapoint. While >>>> I think >>>> we should definitely examine HPACK like compression features as you >>>> mentioned >>>> Björn, I think we should think more about separating that kind of work >>>> into a >>>> Milestone 2 where this is considered. >>>> >>> >>> >>> >>>> For the time being it's very plausible >>>> we could do some negotiation of the receiving Remote Write endpoint by >>>> sending >>>> a "GET" to the remote write endpoint and seeing if it responds with a >>>> "capabilities + preferences" response, and if the endpoint specifies >>>> that it >>>> would like to receive metadata all the time on every single request and >>>> let >>>> Snappy take care of keeping size not ballooning too much, or if it >>>> would like >>>> TYPE on every single datapoint, and HELP and UNIT every DESIRED_SECONDS >>>> or so. >>>> To enable a "send HELP every 10 minutes" feature we would have to add >>>> to the >>>> datastructure that holds the LABELS, TYPE, HELP and UNIT for each >>>> series a >>>> "last sent" timestamp to know when to resend to that backend, but that >>>> seems >>>> entirely plausible and would not use more than 4 extra bytes. >>>> >>> >>> Negotiation is fundamentally stateful, as the process that receives the >>> first request may be a very different one from the one that receives the >>> second - such as if an upgrade is in progress. Remote write is intended to >>> be a very simple thing that's easy to implement on the receiver end and is >>> a send-only request-based protocol, so request-time negotiation is >>> basically out. Any negotiation needs to happen via the config file, and >>> even then it'd be better if nothing ever needed to be configured. Getting >>> all the users of a remote write to change their config file or restart all >>> their Prometheus servers is not an easy task after all. >>> >>> Brian >>> >>> >>>> >>>> These thoughts are based on the discussion I've had and the thoughts on >>>> this >>>> thread. What's the feedback on this before I go ahead and re-iterate >>>> the design >>>> to more closely map to what I'm suggesting here? >>>> >>>> Best, >>>> Rob >>>> >>>> On Thu, Aug 6, 2020 at 2:01 PM Bjoern Rabenstein <[email protected]> >>>> wrote: >>>> >>>>> On 03.08.20 03:04, Rob Skillington wrote: >>>>> > Ok - I have a proposal which could be broken up into two pieces, >>>>> first >>>>> > delivering TYPE per datapoint, the second consistently and reliably >>>>> HELP and >>>>> > UNIT once per unique metric name: >>>>> > >>>>> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo >>>>> > /edit#heading=h.bik9uwphqy3g >>>>> >>>>> Thanks for the doc. I have commented on it, but while doing so, I felt >>>>> the urge to comment more generally, which would not fit well into the >>>>> margin of a Google doc. My thoughts are also a bit out of scope of >>>>> Rob's design doc and more about the general topic of remote write and >>>>> the equally general topic of metadata (about which we have an ongoing >>>>> discussion among the Prometheus developers). >>>>> >>>>> Disclaimer: I don't know the remote-write protocol very well. My hope >>>>> here is that my somewhat distant perspective is of some value as it >>>>> allows to take a step back. However, I might just miss crucial details >>>>> that completely invalidate my thoughts. We'll see... >>>>> >>>>> I do care a lot about metadata, though. (And ironically, the reason >>>>> why I declared remote write "somebody else's problem" is that I've >>>>> always disliked how it fundamentally ignores metadata.) >>>>> >>>>> Rob's document embraces the fact that metadata can change over time, >>>>> but it assumes that at any given time, there is only one set of >>>>> metadata per unique metric name. It takes into account that there can >>>>> be drift, but it considers them an irregularity that will only happen >>>>> occasionally and iron out over time. >>>>> >>>>> In practice, however, metadata can be legitimately and deliberately >>>>> different for different time series of the same name. Instrumentation >>>>> libraries and even the exposition format inherently require one set of >>>>> metadata per metric name, but this is all only enforced (and meant to >>>>> be enforced) _per target_. Once the samples are ingested (or even sent >>>>> onwards via remote write), they have no notion of what target they >>>>> came from. Furthermore, samples created by rule evaluation don't have >>>>> an originating target in the first place. (Which raises the question >>>>> of metadata for recording rules, which is another can of worms I'd >>>>> like to open eventually...) >>>>> >>>>> (There is also the technical difficulty that the WAL has no notion of >>>>> bundling or referencing all the series with the same metric name. That >>>>> was commented about in the doc but is not my focus here.) >>>>> >>>>> Rob's doc sees TYPE as special because it is so cheap to just add to >>>>> every data point. That's correct, but it's giving me an itch: Should >>>>> we really create different ways of handling metadata, depending on its >>>>> expected size? >>>>> >>>>> Compare this with labels. There is no upper limit to their number or >>>>> size. Still, we have no plan of treating "large" labels differently >>>>> from "short" labels. >>>>> >>>>> On top of that, we have by now gained the insight that metadata is >>>>> changing over time and essentially has to be tracked per series. >>>>> >>>>> Or in other words: From a pure storage perspective, metadata behaves >>>>> exactly the same as labels! (There are certainly huge differences >>>>> semantically, but those only manifest themselves on the query level, >>>>> i.e. how you treat it in PromQL etc.) >>>>> >>>>> (This is not exactly a new insight. This is more or less what I said >>>>> during the 2016 dev summit, when we first discussed remote write. But >>>>> I don't want to dwell on "told you so" moments... :o) >>>>> >>>>> There is a good reason why we don't just add metadata as "pseudo >>>>> labels": As discussed a lot in the various design docs including Rob's >>>>> one, it would blow up the data size significantly because HELP strings >>>>> tend to be relatively long. >>>>> >>>>> And that's the point where I would like to take a step back: We are >>>>> discussing to essentially treat something that is structurally the >>>>> same thing in three different ways: Way 1 for labels as we know >>>>> them. Way 2 for "small" metadata. Way 3 for "big" metadata. >>>>> >>>>> However, while labels tend to be shorter than HELP strings, there is >>>>> the occasional use case with long or many labels. (Infamously, at >>>>> SoundCloud, a binary accidentally put a whole HTML page into a >>>>> label. That wasn't a use case, it was a bug, but the Prometheus server >>>>> ingesting that was just chugging along as if nothing special had >>>>> happened. It looked weird in the expression browser, though...) I'm >>>>> sure any vendor offering Prometheus remote storage as a service will >>>>> have a customer or two that use excessively long label names. If we >>>>> have to deal with that, why not bite the bullet and treat metadata in >>>>> the same way as labels in general? Or to phrase it in another way: Any >>>>> solution for "big" metadata could be used for labels, too, to >>>>> alleviate the pain with excessively long label names. >>>>> >>>>> Or most succintly: A robust and really good solution for >>>>> "big" metadata in remote write will make remote write much more >>>>> efficient if applied to labels, too. >>>>> >>>>> Imagine an NALSD tech interview question that boils down to "design >>>>> Prometheus remote write". I bet that most of the better candidates >>>>> will recognize that most of the payload will consist of series >>>>> indentifiers (call them labels or whatever) and they will suggest to >>>>> first transmit some kind of index and from then on only transmit short >>>>> series IDs. The best candidates will then find out about all the >>>>> problems with that: How to keep the protocol stateless, how to re-sync >>>>> the index, how to update it if new series arrive etc. Those are >>>>> certainly all good reasons why remote write as we know it does not >>>>> transfer an index of series IDs. >>>>> >>>>> However, my point here is that we are now discussing exactly those >>>>> problems when we talk about metadata transmission. Let's solve those >>>>> problems and apply them to remote write in general! >>>>> >>>>> Some thoughts about that: >>>>> >>>>> Current remote write essentially transfers all labels for _every_ >>>>> sample. This works reasonably well. Even if metadata blows up the data >>>>> size by 5x or 10x, transfering the whole index of metadata and labels >>>>> should remain feasible as long as we do it less frequently than once >>>>> every 10 samples. It's something that could be done each time a >>>>> remote-write receiver connects. From then on, we "only" have to track >>>>> when new series (or series with new metadata) show up and transfer >>>>> those. (I know it's not trivial, but we are already discussing >>>>> possible solutions in the various design docs.) Whenever a >>>>> remote-write receiver gets out of sync for some reason, it can simply >>>>> cut the connection and start with a complete re-sync again. As long as >>>>> that doesn't happen more often than once every 10 samples, we still >>>>> have a net gain. Combining this with sharding is another challenge, >>>>> but it doesn't appear unsolveable. >>>>> >>>>> -- >>>>> Björn Rabenstein >>>>> [PGP-ID] 0x851C3DA17D748D03 >>>>> [email] [email protected] >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Prometheus Developers" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> >>> >>> -- >>> Brian Brazil >>> www.robustperception.io >>> >> > > -- > Brian Brazil > www.robustperception.io > -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CABakzZZMJeZwJGVy%2B92gWUjLFeU_g9CasUqJ9i8qd%3D9dwWkMTg%40mail.gmail.com.

