Re: [prometheus-developers] Re: Remote Write Metadata propagation

Rob Skillington Sat, 08 Aug 2020 08:22:22 -0700

Here's a draft PR that builds that propagates metadata to the WAL and the
WAL
reader can read it back:
https://github.com/robskillington/prometheus/pull/1/files


Would like a little bit of feedback before on the datatypes and structure
going
further if folks are open to that.

There's a few things not happening:
- Remote write queue manager does not use or send these extra fields yet.
- Head does not reset the "metadata" slice (not sure where "series" slice
is
  reset in the head for pending series writes to WAL, want to do in same
place).
- Metadata is not re-written on change yet.
- Tests.


On Sat, Aug 8, 2020 at 9:37 AM Rob Skillington <[email protected]> wrote:

> Sounds good, I've updated the proposal with details on places in which
> changes
> are required given the new approach:
>
> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo/edit#
>
>
> On Fri, Aug 7, 2020 at 2:09 PM Brian Brazil <
> [email protected]> wrote:
>
>> On Fri, 7 Aug 2020 at 15:48, Rob Skillington <[email protected]> wrote:
>>
>>> True - I mean this could also be a blacklist by config perhaps, so if
>>> you
>>> really don't want to have increased egress you can optionally turn off
>>> sending
>>> the TYPE, HELP, UNIT or send them at different frequencies via config.
>>> We could
>>> package some sensible defaults so folks don't need to update their
>>> config.
>>>
>>> The main intention is to enable these added features and make it
>>> possible for
>>> various consumers to be able to adjust some of these parameters if
>>> required
>>> since backends can be so different in their implementation. For M3 I
>>> would be
>>> totally fine with the extra egress that should be mitigated fairly
>>> considerably
>>> by Snappy and the fact that HELP is common across certain metric
>>> families and
>>> receiving it every single Remote Write request.
>>>
>>
>> That's really a micro-optimisation. If you are that worried about
>> bandwidth you'd run a sidecar specific to your remote backend that was
>> stateful and far more efficient overall. Sending the full label names and
>> values on every request is going to be far more than the overhead of
>> metadata on top of that, so I don't see a need as it stands for any of this
>> to be configurable.
>>
>> Brian
>>
>>
>>>
>>> On Fri, Aug 7, 2020 at 3:56 AM Brian Brazil <
>>> [email protected]> wrote:
>>>
>>>> On Thu, 6 Aug 2020 at 22:58, Rob Skillington <[email protected]>
>>>> wrote:
>>>>
>>>>> Hey Björn,
>>>>>
>>>>>
>>>>> Thanks for the detailed response. I've had a few back and forths on
>>>>> this with
>>>>> Brian and Chris over IRC and CNCF Slack now too.
>>>>>
>>>>> I agree that fundamentally it seems naive to idealistically model this
>>>>> around
>>>>> per metric name. It needs to be per series given what may happen
>>>>> w.r.t.
>>>>> collision across targets, etc.
>>>>>
>>>>> Perhaps we can separate these discussions apart into two
>>>>> considerations:
>>>>>
>>>>> 1) Modeling of the data such that it is kept around for transmission
>>>>> (primarily
>>>>> we're focused on WAL here).
>>>>>
>>>>> 2) Transmission (and of which you allude to has many areas for
>>>>> improvement).
>>>>>
>>>>> For (1) - it seems like this needs to be done per time series,
>>>>> thankfully we
>>>>> actually already have modeled this to be stored per series data just
>>>>> once in a
>>>>> single WAL file. I will write up my proposal here, but it will
>>>>> surmount to
>>>>> essentially encoding the HELP, UNIT and TYPE to the WAL per series
>>>>> similar to
>>>>> how labels for a series are encoded once per series in the WAL. Since
>>>>> this
>>>>> optimization is in place, there's already a huge dampening effect on
>>>>> how
>>>>> expensive it is to write out data about a series (e.g. labels). We can
>>>>> always
>>>>> go and collect a sample WAL file and measure how much extra size
>>>>> with/without
>>>>> HELP, UNIT and TYPE this would add, but it seems like it won't
>>>>> fundamentally
>>>>> change the order of magnitude in terms of "information about a
>>>>> timeseries
>>>>> storage size" vs "datapoints about a timeseries storage size". One
>>>>> extra change
>>>>> would be re-encoding the series into the WAL if the HELP changed for
>>>>> that
>>>>> series, just so that when HELP does change it can be up to date from
>>>>> the view
>>>>> of whoever is reading the WAL (i.e. the Remote Write loop). Since this
>>>>> entry
>>>>> needs to be loaded into memory for Remote Write today anyway, with
>>>>> string
>>>>> interning as suggested by Chris, it won't change the memory profile
>>>>> algorithmically of a Prometheus with Remote Write enabled. There will
>>>>> be some
>>>>> overhead that at most would likely be similar to the label data, but
>>>>> we aren't
>>>>> altering data structures (so won't change big-O magnitude of memory
>>>>> being used),
>>>>> we're adding fields to existing data structures that exist and string
>>>>> interning
>>>>> should actually make it much less onerous since there is a large
>>>>> duplicative
>>>>> effect with HELP among time series.
>>>>>
>>>>> For (2) - now we have basically TYPE, HELP and UNIT all available for
>>>>> transmission if we wanted to send it with every single datapoint.
>>>>> While I think
>>>>> we should definitely examine HPACK like compression features as you
>>>>> mentioned
>>>>> Björn, I think we should think more about separating that kind of work
>>>>> into a
>>>>> Milestone 2 where this is considered.
>>>>>
>>>>
>>>>
>>>>
>>>>> For the time being it's very plausible
>>>>> we could do some negotiation of the receiving Remote Write endpoint by
>>>>> sending
>>>>> a "GET" to the remote write endpoint and seeing if it responds with a
>>>>> "capabilities + preferences" response, and if the endpoint specifies
>>>>> that it
>>>>> would like to receive metadata all the time on every single request
>>>>> and let
>>>>> Snappy take care of keeping size not ballooning too much, or if it
>>>>> would like
>>>>> TYPE on every single datapoint, and HELP and UNIT every
>>>>> DESIRED_SECONDS or so.
>>>>> To enable a "send HELP every 10 minutes" feature we would have to add
>>>>> to the
>>>>> datastructure that holds the LABELS, TYPE, HELP and UNIT for each
>>>>> series a
>>>>> "last sent" timestamp to know when to resend to that backend, but that
>>>>> seems
>>>>> entirely plausible and would not use more than 4 extra bytes.
>>>>>
>>>>
>>>> Negotiation is fundamentally stateful, as the process that receives the
>>>> first request may be a very different one from the one that receives the
>>>> second - such as if an upgrade is in progress. Remote write is intended to
>>>> be a very simple thing that's easy to implement on the receiver end and is
>>>> a send-only request-based protocol, so request-time negotiation is
>>>> basically out. Any negotiation needs to happen via the config file, and
>>>> even then it'd be better if nothing ever needed to be configured. Getting
>>>> all the users of a remote write to change their config file or restart all
>>>> their Prometheus servers is not an easy task after all.
>>>>
>>>> Brian
>>>>
>>>>
>>>>>
>>>>> These thoughts are based on the discussion I've had and the thoughts
>>>>> on this
>>>>> thread. What's the feedback on this before I go ahead and re-iterate
>>>>> the design
>>>>> to more closely map to what I'm suggesting here?
>>>>>
>>>>> Best,
>>>>> Rob
>>>>>
>>>>> On Thu, Aug 6, 2020 at 2:01 PM Bjoern Rabenstein <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> On 03.08.20 03:04, Rob Skillington wrote:
>>>>>> > Ok - I have a proposal which could be broken up into two pieces,
>>>>>> first
>>>>>> > delivering TYPE per datapoint, the second consistently and reliably
>>>>>> HELP and
>>>>>> > UNIT once per unique metric name:
>>>>>> >
>>>>>> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo
>>>>>> > /edit#heading=h.bik9uwphqy3g
>>>>>>
>>>>>> Thanks for the doc. I have commented on it, but while doing so, I felt
>>>>>> the urge to comment more generally, which would not fit well into the
>>>>>> margin of a Google doc. My thoughts are also a bit out of scope of
>>>>>> Rob's design doc and more about the general topic of remote write and
>>>>>> the equally general topic of metadata (about which we have an ongoing
>>>>>> discussion among the Prometheus developers).
>>>>>>
>>>>>> Disclaimer: I don't know the remote-write protocol very well. My hope
>>>>>> here is that my somewhat distant perspective is of some value as it
>>>>>> allows to take a step back. However, I might just miss crucial details
>>>>>> that completely invalidate my thoughts. We'll see...
>>>>>>
>>>>>> I do care a lot about metadata, though. (And ironically, the reason
>>>>>> why I declared remote write "somebody else's problem" is that I've
>>>>>> always disliked how it fundamentally ignores metadata.)
>>>>>>
>>>>>> Rob's document embraces the fact that metadata can change over time,
>>>>>> but it assumes that at any given time, there is only one set of
>>>>>> metadata per unique metric name. It takes into account that there can
>>>>>> be drift, but it considers them an irregularity that will only happen
>>>>>> occasionally and iron out over time.
>>>>>>
>>>>>> In practice, however, metadata can be legitimately and deliberately
>>>>>> different for different time series of the same name. Instrumentation
>>>>>> libraries and even the exposition format inherently require one set of
>>>>>> metadata per metric name, but this is all only enforced (and meant to
>>>>>> be enforced) _per target_. Once the samples are ingested (or even sent
>>>>>> onwards via remote write), they have no notion of what target they
>>>>>> came from. Furthermore, samples created by rule evaluation don't have
>>>>>> an originating target in the first place. (Which raises the question
>>>>>> of metadata for recording rules, which is another can of worms I'd
>>>>>> like to open eventually...)
>>>>>>
>>>>>> (There is also the technical difficulty that the WAL has no notion of
>>>>>> bundling or referencing all the series with the same metric name. That
>>>>>> was commented about in the doc but is not my focus here.)
>>>>>>
>>>>>> Rob's doc sees TYPE as special because it is so cheap to just add to
>>>>>> every data point. That's correct, but it's giving me an itch: Should
>>>>>> we really create different ways of handling metadata, depending on its
>>>>>> expected size?
>>>>>>
>>>>>> Compare this with labels. There is no upper limit to their number or
>>>>>> size. Still, we have no plan of treating "large" labels differently
>>>>>> from "short" labels.
>>>>>>
>>>>>> On top of that, we have by now gained the insight that metadata is
>>>>>> changing over time and essentially has to be tracked per series.
>>>>>>
>>>>>> Or in other words: From a pure storage perspective, metadata behaves
>>>>>> exactly the same as labels! (There are certainly huge differences
>>>>>> semantically, but those only manifest themselves on the query level,
>>>>>> i.e. how you treat it in PromQL etc.)
>>>>>>
>>>>>> (This is not exactly a new insight. This is more or less what I said
>>>>>> during the 2016 dev summit, when we first discussed remote write. But
>>>>>> I don't want to dwell on "told you so" moments... :o)
>>>>>>
>>>>>> There is a good reason why we don't just add metadata as "pseudo
>>>>>> labels": As discussed a lot in the various design docs including Rob's
>>>>>> one, it would blow up the data size significantly because HELP strings
>>>>>> tend to be relatively long.
>>>>>>
>>>>>> And that's the point where I would like to take a step back: We are
>>>>>> discussing to essentially treat something that is structurally the
>>>>>> same thing in three different ways: Way 1 for labels as we know
>>>>>> them. Way 2 for "small" metadata. Way 3 for "big" metadata.
>>>>>>
>>>>>> However, while labels tend to be shorter than HELP strings, there is
>>>>>> the occasional use case with long or many labels. (Infamously, at
>>>>>> SoundCloud, a binary accidentally put a whole HTML page into a
>>>>>> label. That wasn't a use case, it was a bug, but the Prometheus server
>>>>>> ingesting that was just chugging along as if nothing special had
>>>>>> happened. It looked weird in the expression browser, though...) I'm
>>>>>> sure any vendor offering Prometheus remote storage as a service will
>>>>>> have a customer or two that use excessively long label names. If we
>>>>>> have to deal with that, why not bite the bullet and treat metadata in
>>>>>> the same way as labels in general? Or to phrase it in another way: Any
>>>>>> solution for "big" metadata could be used for labels, too, to
>>>>>> alleviate the pain with excessively long label names.
>>>>>>
>>>>>> Or most succintly: A robust and really good solution for
>>>>>> "big" metadata in remote write will make remote write much more
>>>>>> efficient if applied to labels, too.
>>>>>>
>>>>>> Imagine an NALSD tech interview question that boils down to "design
>>>>>> Prometheus remote write". I bet that most of the better candidates
>>>>>> will recognize that most of the payload will consist of series
>>>>>> indentifiers (call them labels or whatever) and they will suggest to
>>>>>> first transmit some kind of index and from then on only transmit short
>>>>>> series IDs. The best candidates will then find out about all the
>>>>>> problems with that: How to keep the protocol stateless, how to re-sync
>>>>>> the index, how to update it if new series arrive etc. Those are
>>>>>> certainly all good reasons why remote write as we know it does not
>>>>>> transfer an index of series IDs.
>>>>>>
>>>>>> However, my point here is that we are now discussing exactly those
>>>>>> problems when we talk about metadata transmission. Let's solve those
>>>>>> problems and apply them to remote write in general!
>>>>>>
>>>>>> Some thoughts about that:
>>>>>>
>>>>>> Current remote write essentially transfers all labels for _every_
>>>>>> sample. This works reasonably well. Even if metadata blows up the data
>>>>>> size by 5x or 10x, transfering the whole index of metadata and labels
>>>>>> should remain feasible as long as we do it less frequently than once
>>>>>> every 10 samples. It's something that could be done each time a
>>>>>> remote-write receiver connects. From then on, we "only" have to track
>>>>>> when new series (or series with new metadata) show up and transfer
>>>>>> those. (I know it's not trivial, but we are already discussing
>>>>>> possible solutions in the various design docs.) Whenever a
>>>>>> remote-write receiver gets out of sync for some reason, it can simply
>>>>>> cut the connection and start with a complete re-sync again. As long as
>>>>>> that doesn't happen more often than once every 10 samples, we still
>>>>>> have a net gain. Combining this with sharding is another challenge,
>>>>>> but it doesn't appear unsolveable.
>>>>>>
>>>>>> --
>>>>>> Björn Rabenstein
>>>>>> [PGP-ID] 0x851C3DA17D748D03
>>>>>> [email] [email protected]
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Prometheus Developers" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>
>>>>
>>>> --
>>>> Brian Brazil
>>>> www.robustperception.io
>>>>
>>>
>>
>> --
>> Brian Brazil
>> www.robustperception.io
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CABakzZbH6Ghod3AWhmE4H_m%3D%2BAepifQstVgt1JbPZD67x4UCTA%40mail.gmail.com.

Re: [prometheus-developers] Re: Remote Write Metadata propagation

Reply via email to