Re: [prometheus-developers] Re: Remote Write Metadata propagation

Julien Pivotto Tue, 11 Aug 2020 03:08:09 -0700

On 11 Aug 11:05, Brian Brazil wrote:
> On Tue, 11 Aug 2020 at 04:09, Callum Styan <[email protected]> wrote:
> 
> > I'm hesitant to add anything that significantly increases the network
> > bandwidth usage or remote write while at the same time not giving users a
> > way to tune the usage to their needs.
> >
> > I agree with Brian that we don't want the protocol itself to become
> > stateful by introducing something like negotiation. I'd also prefer not to
> > introduce multiple ways to do things, though I'm hoping we can find a way
> > to accommodate your use case while not ballooning average users network
> > egress bill.
> >
> > I am fine with forcing the consuming end to be somewhat stateful like in
> > the case of Josh's PR where all metadata is sent periodically and must be
> > stored by the remote storage system.
> >
> 
> 
> 
> > Overall I'd like to see some numbers regarding current network bandwidth
> > of remote write, remote write with metadata via Josh's PR, and remote write
> > with sending metadata for every series in a remote write payload.
> >
> 
> I agree, I noticed that in Rob's PR and had the same thought.


Remote bandwidth are likely to affect only people using remote write. 

Getting a view on the on-disk size of the WAL would be great too, as
that will affect everyone.

> 
> Brian
> 
> 
> >
> > Rob, I'll review your PR tomorrow but it looks like Julien and Brian may
> > already have that covered.
> >
> > On Sun, Aug 9, 2020 at 9:36 PM Rob Skillington <[email protected]>
> > wrote:
> >
> >> Update: The PR now sends the fields over remote write from the WAL and
> >> metadata
> >> is also updated in the WAL when any field changes.
> >>
> >> Now opened the PR against the primary repo:
> >> https://github.com/prometheus/prometheus/pull/7771
> >>
> >> I have tested this end-to-end with a modified M3 branch:
> >> https://github.com/m3db/m3/compare/r/test-prometheus-metadata
> >> > {... "msg":"received
> >> series","labels":"{__name__="prometheus_rule_group_...
> >> > iterations_total",instance="localhost:9090",job="prometheus01",role=...
> >> > "remote"}","type":"counter","unit":"","help":"The total number of
> >> scheduled...
> >> > rule group evaluations, whether executed or missed."}
> >>
> >> Tests still haven't been updated. Please any feedback on the approach /
> >> data structures would be greatly appreciated.
> >>
> >> Would be good to know what others thoughts are on next steps.
> >>
> >> On Sat, Aug 8, 2020 at 11:21 AM Rob Skillington <[email protected]>
> >> wrote:
> >>
> >>> Here's a draft PR that builds that propagates metadata to the WAL and
> >>> the WAL
> >>> reader can read it back:
> >>> https://github.com/robskillington/prometheus/pull/1/files
> >>>
> >>> Would like a little bit of feedback before on the datatypes and
> >>> structure going
> >>> further if folks are open to that.
> >>>
> >>> There's a few things not happening:
> >>> - Remote write queue manager does not use or send these extra fields yet.
> >>> - Head does not reset the "metadata" slice (not sure where "series"
> >>> slice is
> >>>   reset in the head for pending series writes to WAL, want to do in same
> >>> place).
> >>> - Metadata is not re-written on change yet.
> >>> - Tests.
> >>>
> >>>
> >>> On Sat, Aug 8, 2020 at 9:37 AM Rob Skillington <[email protected]>
> >>> wrote:
> >>>
> >>>> Sounds good, I've updated the proposal with details on places in which
> >>>> changes
> >>>> are required given the new approach:
> >>>>
> >>>> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo/edit#
> >>>>
> >>>>
> >>>> On Fri, Aug 7, 2020 at 2:09 PM Brian Brazil <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> On Fri, 7 Aug 2020 at 15:48, Rob Skillington <[email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> True - I mean this could also be a blacklist by config perhaps, so if
> >>>>>> you
> >>>>>> really don't want to have increased egress you can optionally turn
> >>>>>> off sending
> >>>>>> the TYPE, HELP, UNIT or send them at different frequencies via
> >>>>>> config. We could
> >>>>>> package some sensible defaults so folks don't need to update their
> >>>>>> config.
> >>>>>>
> >>>>>> The main intention is to enable these added features and make it
> >>>>>> possible for
> >>>>>> various consumers to be able to adjust some of these parameters if
> >>>>>> required
> >>>>>> since backends can be so different in their implementation. For M3 I
> >>>>>> would be
> >>>>>> totally fine with the extra egress that should be mitigated fairly
> >>>>>> considerably
> >>>>>> by Snappy and the fact that HELP is common across certain metric
> >>>>>> families and
> >>>>>> receiving it every single Remote Write request.
> >>>>>>
> >>>>>
> >>>>> That's really a micro-optimisation. If you are that worried about
> >>>>> bandwidth you'd run a sidecar specific to your remote backend that was
> >>>>> stateful and far more efficient overall. Sending the full label names 
> >>>>> and
> >>>>> values on every request is going to be far more than the overhead of
> >>>>> metadata on top of that, so I don't see a need as it stands for any of 
> >>>>> this
> >>>>> to be configurable.
> >>>>>
> >>>>> Brian
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> On Fri, Aug 7, 2020 at 3:56 AM Brian Brazil <
> >>>>>> [email protected]> wrote:
> >>>>>>
> >>>>>>> On Thu, 6 Aug 2020 at 22:58, Rob Skillington <[email protected]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hey Björn,
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks for the detailed response. I've had a few back and forths on
> >>>>>>>> this with
> >>>>>>>> Brian and Chris over IRC and CNCF Slack now too.
> >>>>>>>>
> >>>>>>>> I agree that fundamentally it seems naive to idealistically model
> >>>>>>>> this around
> >>>>>>>> per metric name. It needs to be per series given what may happen
> >>>>>>>> w.r.t.
> >>>>>>>> collision across targets, etc.
> >>>>>>>>
> >>>>>>>> Perhaps we can separate these discussions apart into two
> >>>>>>>> considerations:
> >>>>>>>>
> >>>>>>>> 1) Modeling of the data such that it is kept around for
> >>>>>>>> transmission (primarily
> >>>>>>>> we're focused on WAL here).
> >>>>>>>>
> >>>>>>>> 2) Transmission (and of which you allude to has many areas for
> >>>>>>>> improvement).
> >>>>>>>>
> >>>>>>>> For (1) - it seems like this needs to be done per time series,
> >>>>>>>> thankfully we
> >>>>>>>> actually already have modeled this to be stored per series data
> >>>>>>>> just once in a
> >>>>>>>> single WAL file. I will write up my proposal here, but it will
> >>>>>>>> surmount to
> >>>>>>>> essentially encoding the HELP, UNIT and TYPE to the WAL per series
> >>>>>>>> similar to
> >>>>>>>> how labels for a series are encoded once per series in the WAL.
> >>>>>>>> Since this
> >>>>>>>> optimization is in place, there's already a huge dampening effect
> >>>>>>>> on how
> >>>>>>>> expensive it is to write out data about a series (e.g. labels). We
> >>>>>>>> can always
> >>>>>>>> go and collect a sample WAL file and measure how much extra size
> >>>>>>>> with/without
> >>>>>>>> HELP, UNIT and TYPE this would add, but it seems like it won't
> >>>>>>>> fundamentally
> >>>>>>>> change the order of magnitude in terms of "information about a
> >>>>>>>> timeseries
> >>>>>>>> storage size" vs "datapoints about a timeseries storage size". One
> >>>>>>>> extra change
> >>>>>>>> would be re-encoding the series into the WAL if the HELP changed
> >>>>>>>> for that
> >>>>>>>> series, just so that when HELP does change it can be up to date
> >>>>>>>> from the view
> >>>>>>>> of whoever is reading the WAL (i.e. the Remote Write loop). Since
> >>>>>>>> this entry
> >>>>>>>> needs to be loaded into memory for Remote Write today anyway, with
> >>>>>>>> string
> >>>>>>>> interning as suggested by Chris, it won't change the memory profile
> >>>>>>>> algorithmically of a Prometheus with Remote Write enabled. There
> >>>>>>>> will be some
> >>>>>>>> overhead that at most would likely be similar to the label data,
> >>>>>>>> but we aren't
> >>>>>>>> altering data structures (so won't change big-O magnitude of memory
> >>>>>>>> being used),
> >>>>>>>> we're adding fields to existing data structures that exist and
> >>>>>>>> string interning
> >>>>>>>> should actually make it much less onerous since there is a large
> >>>>>>>> duplicative
> >>>>>>>> effect with HELP among time series.
> >>>>>>>>
> >>>>>>>> For (2) - now we have basically TYPE, HELP and UNIT all available
> >>>>>>>> for
> >>>>>>>> transmission if we wanted to send it with every single datapoint.
> >>>>>>>> While I think
> >>>>>>>> we should definitely examine HPACK like compression features as you
> >>>>>>>> mentioned
> >>>>>>>> Björn, I think we should think more about separating that kind of
> >>>>>>>> work into a
> >>>>>>>> Milestone 2 where this is considered.
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> For the time being it's very plausible
> >>>>>>>> we could do some negotiation of the receiving Remote Write endpoint
> >>>>>>>> by sending
> >>>>>>>> a "GET" to the remote write endpoint and seeing if it responds with
> >>>>>>>> a
> >>>>>>>> "capabilities + preferences" response, and if the endpoint
> >>>>>>>> specifies that it
> >>>>>>>> would like to receive metadata all the time on every single request
> >>>>>>>> and let
> >>>>>>>> Snappy take care of keeping size not ballooning too much, or if it
> >>>>>>>> would like
> >>>>>>>> TYPE on every single datapoint, and HELP and UNIT every
> >>>>>>>> DESIRED_SECONDS or so.
> >>>>>>>> To enable a "send HELP every 10 minutes" feature we would have to
> >>>>>>>> add to the
> >>>>>>>> datastructure that holds the LABELS, TYPE, HELP and UNIT for each
> >>>>>>>> series a
> >>>>>>>> "last sent" timestamp to know when to resend to that backend, but
> >>>>>>>> that seems
> >>>>>>>> entirely plausible and would not use more than 4 extra bytes.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Negotiation is fundamentally stateful, as the process that receives
> >>>>>>> the first request may be a very different one from the one that 
> >>>>>>> receives
> >>>>>>> the second - such as if an upgrade is in progress. Remote write is 
> >>>>>>> intended
> >>>>>>> to be a very simple thing that's easy to implement on the receiver 
> >>>>>>> end and
> >>>>>>> is a send-only request-based protocol, so request-time negotiation is
> >>>>>>> basically out. Any negotiation needs to happen via the config file, 
> >>>>>>> and
> >>>>>>> even then it'd be better if nothing ever needed to be configured. 
> >>>>>>> Getting
> >>>>>>> all the users of a remote write to change their config file or 
> >>>>>>> restart all
> >>>>>>> their Prometheus servers is not an easy task after all.
> >>>>>>>
> >>>>>>> Brian
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> These thoughts are based on the discussion I've had and the
> >>>>>>>> thoughts on this
> >>>>>>>> thread. What's the feedback on this before I go ahead and
> >>>>>>>> re-iterate the design
> >>>>>>>> to more closely map to what I'm suggesting here?
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Rob
> >>>>>>>>
> >>>>>>>> On Thu, Aug 6, 2020 at 2:01 PM Bjoern Rabenstein <
> >>>>>>>> [email protected]> wrote:
> >>>>>>>>
> >>>>>>>>> On 03.08.20 03:04, Rob Skillington wrote:
> >>>>>>>>> > Ok - I have a proposal which could be broken up into two pieces,
> >>>>>>>>> first
> >>>>>>>>> > delivering TYPE per datapoint, the second consistently and
> >>>>>>>>> reliably HELP and
> >>>>>>>>> > UNIT once per unique metric name:
> >>>>>>>>> >
> >>>>>>>>> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo
> >>>>>>>>> > /edit#heading=h.bik9uwphqy3g
> >>>>>>>>>
> >>>>>>>>> Thanks for the doc. I have commented on it, but while doing so, I
> >>>>>>>>> felt
> >>>>>>>>> the urge to comment more generally, which would not fit well into
> >>>>>>>>> the
> >>>>>>>>> margin of a Google doc. My thoughts are also a bit out of scope of
> >>>>>>>>> Rob's design doc and more about the general topic of remote write
> >>>>>>>>> and
> >>>>>>>>> the equally general topic of metadata (about which we have an
> >>>>>>>>> ongoing
> >>>>>>>>> discussion among the Prometheus developers).
> >>>>>>>>>
> >>>>>>>>> Disclaimer: I don't know the remote-write protocol very well. My
> >>>>>>>>> hope
> >>>>>>>>> here is that my somewhat distant perspective is of some value as it
> >>>>>>>>> allows to take a step back. However, I might just miss crucial
> >>>>>>>>> details
> >>>>>>>>> that completely invalidate my thoughts. We'll see...
> >>>>>>>>>
> >>>>>>>>> I do care a lot about metadata, though. (And ironically, the reason
> >>>>>>>>> why I declared remote write "somebody else's problem" is that I've
> >>>>>>>>> always disliked how it fundamentally ignores metadata.)
> >>>>>>>>>
> >>>>>>>>> Rob's document embraces the fact that metadata can change over
> >>>>>>>>> time,
> >>>>>>>>> but it assumes that at any given time, there is only one set of
> >>>>>>>>> metadata per unique metric name. It takes into account that there
> >>>>>>>>> can
> >>>>>>>>> be drift, but it considers them an irregularity that will only
> >>>>>>>>> happen
> >>>>>>>>> occasionally and iron out over time.
> >>>>>>>>>
> >>>>>>>>> In practice, however, metadata can be legitimately and deliberately
> >>>>>>>>> different for different time series of the same name.
> >>>>>>>>> Instrumentation
> >>>>>>>>> libraries and even the exposition format inherently require one
> >>>>>>>>> set of
> >>>>>>>>> metadata per metric name, but this is all only enforced (and meant
> >>>>>>>>> to
> >>>>>>>>> be enforced) _per target_. Once the samples are ingested (or even
> >>>>>>>>> sent
> >>>>>>>>> onwards via remote write), they have no notion of what target they
> >>>>>>>>> came from. Furthermore, samples created by rule evaluation don't
> >>>>>>>>> have
> >>>>>>>>> an originating target in the first place. (Which raises the
> >>>>>>>>> question
> >>>>>>>>> of metadata for recording rules, which is another can of worms I'd
> >>>>>>>>> like to open eventually...)
> >>>>>>>>>
> >>>>>>>>> (There is also the technical difficulty that the WAL has no notion
> >>>>>>>>> of
> >>>>>>>>> bundling or referencing all the series with the same metric name.
> >>>>>>>>> That
> >>>>>>>>> was commented about in the doc but is not my focus here.)
> >>>>>>>>>
> >>>>>>>>> Rob's doc sees TYPE as special because it is so cheap to just add
> >>>>>>>>> to
> >>>>>>>>> every data point. That's correct, but it's giving me an itch:
> >>>>>>>>> Should
> >>>>>>>>> we really create different ways of handling metadata, depending on
> >>>>>>>>> its
> >>>>>>>>> expected size?
> >>>>>>>>>
> >>>>>>>>> Compare this with labels. There is no upper limit to their number
> >>>>>>>>> or
> >>>>>>>>> size. Still, we have no plan of treating "large" labels differently
> >>>>>>>>> from "short" labels.
> >>>>>>>>>
> >>>>>>>>> On top of that, we have by now gained the insight that metadata is
> >>>>>>>>> changing over time and essentially has to be tracked per series.
> >>>>>>>>>
> >>>>>>>>> Or in other words: From a pure storage perspective, metadata
> >>>>>>>>> behaves
> >>>>>>>>> exactly the same as labels! (There are certainly huge differences
> >>>>>>>>> semantically, but those only manifest themselves on the query
> >>>>>>>>> level,
> >>>>>>>>> i.e. how you treat it in PromQL etc.)
> >>>>>>>>>
> >>>>>>>>> (This is not exactly a new insight. This is more or less what I
> >>>>>>>>> said
> >>>>>>>>> during the 2016 dev summit, when we first discussed remote write.
> >>>>>>>>> But
> >>>>>>>>> I don't want to dwell on "told you so" moments... :o)
> >>>>>>>>>
> >>>>>>>>> There is a good reason why we don't just add metadata as "pseudo
> >>>>>>>>> labels": As discussed a lot in the various design docs including
> >>>>>>>>> Rob's
> >>>>>>>>> one, it would blow up the data size significantly because HELP
> >>>>>>>>> strings
> >>>>>>>>> tend to be relatively long.
> >>>>>>>>>
> >>>>>>>>> And that's the point where I would like to take a step back: We are
> >>>>>>>>> discussing to essentially treat something that is structurally the
> >>>>>>>>> same thing in three different ways: Way 1 for labels as we know
> >>>>>>>>> them. Way 2 for "small" metadata. Way 3 for "big" metadata.
> >>>>>>>>>
> >>>>>>>>> However, while labels tend to be shorter than HELP strings, there
> >>>>>>>>> is
> >>>>>>>>> the occasional use case with long or many labels. (Infamously, at
> >>>>>>>>> SoundCloud, a binary accidentally put a whole HTML page into a
> >>>>>>>>> label. That wasn't a use case, it was a bug, but the Prometheus
> >>>>>>>>> server
> >>>>>>>>> ingesting that was just chugging along as if nothing special had
> >>>>>>>>> happened. It looked weird in the expression browser, though...) I'm
> >>>>>>>>> sure any vendor offering Prometheus remote storage as a service
> >>>>>>>>> will
> >>>>>>>>> have a customer or two that use excessively long label names. If we
> >>>>>>>>> have to deal with that, why not bite the bullet and treat metadata
> >>>>>>>>> in
> >>>>>>>>> the same way as labels in general? Or to phrase it in another way:
> >>>>>>>>> Any
> >>>>>>>>> solution for "big" metadata could be used for labels, too, to
> >>>>>>>>> alleviate the pain with excessively long label names.
> >>>>>>>>>
> >>>>>>>>> Or most succintly: A robust and really good solution for
> >>>>>>>>> "big" metadata in remote write will make remote write much more
> >>>>>>>>> efficient if applied to labels, too.
> >>>>>>>>>
> >>>>>>>>> Imagine an NALSD tech interview question that boils down to "design
> >>>>>>>>> Prometheus remote write". I bet that most of the better candidates
> >>>>>>>>> will recognize that most of the payload will consist of series
> >>>>>>>>> indentifiers (call them labels or whatever) and they will suggest
> >>>>>>>>> to
> >>>>>>>>> first transmit some kind of index and from then on only transmit
> >>>>>>>>> short
> >>>>>>>>> series IDs. The best candidates will then find out about all the
> >>>>>>>>> problems with that: How to keep the protocol stateless, how to
> >>>>>>>>> re-sync
> >>>>>>>>> the index, how to update it if new series arrive etc. Those are
> >>>>>>>>> certainly all good reasons why remote write as we know it does not
> >>>>>>>>> transfer an index of series IDs.
> >>>>>>>>>
> >>>>>>>>> However, my point here is that we are now discussing exactly those
> >>>>>>>>> problems when we talk about metadata transmission. Let's solve
> >>>>>>>>> those
> >>>>>>>>> problems and apply them to remote write in general!
> >>>>>>>>>
> >>>>>>>>> Some thoughts about that:
> >>>>>>>>>
> >>>>>>>>> Current remote write essentially transfers all labels for _every_
> >>>>>>>>> sample. This works reasonably well. Even if metadata blows up the
> >>>>>>>>> data
> >>>>>>>>> size by 5x or 10x, transfering the whole index of metadata and
> >>>>>>>>> labels
> >>>>>>>>> should remain feasible as long as we do it less frequently than
> >>>>>>>>> once
> >>>>>>>>> every 10 samples. It's something that could be done each time a
> >>>>>>>>> remote-write receiver connects. From then on, we "only" have to
> >>>>>>>>> track
> >>>>>>>>> when new series (or series with new metadata) show up and transfer
> >>>>>>>>> those. (I know it's not trivial, but we are already discussing
> >>>>>>>>> possible solutions in the various design docs.) Whenever a
> >>>>>>>>> remote-write receiver gets out of sync for some reason, it can
> >>>>>>>>> simply
> >>>>>>>>> cut the connection and start with a complete re-sync again. As
> >>>>>>>>> long as
> >>>>>>>>> that doesn't happen more often than once every 10 samples, we still
> >>>>>>>>> have a net gain. Combining this with sharding is another challenge,
> >>>>>>>>> but it doesn't appear unsolveable.
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Björn Rabenstein
> >>>>>>>>> [PGP-ID] 0x851C3DA17D748D03
> >>>>>>>>> [email] [email protected]
> >>>>>>>>>
> >>>>>>>> --
> >>>>>>>> You received this message because you are subscribed to the Google
> >>>>>>>> Groups "Prometheus Developers" group.
> >>>>>>>> To unsubscribe from this group and stop receiving emails from it,
> >>>>>>>> send an email to [email protected]
> >>>>>>>> .
> >>>>>>>> To view this discussion on the web visit
> >>>>>>>> https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com
> >>>>>>>> <https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> >>>>>>>> .
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Brian Brazil
> >>>>>>> www.robustperception.io
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Brian Brazil
> >>>>> www.robustperception.io
> >>>>>
> >>>> --
> >> You received this message because you are subscribed to the Google Groups
> >> "Prometheus Developers" group.
> >> To unsubscribe from this group and stop receiving emails from it, send an
> >> email to [email protected].
> >> To view this discussion on the web visit
> >> https://groups.google.com/d/msgid/prometheus-developers/CABakzZb%2BX-ErewAKEyg54_FVRmTSypbnNFmR-8ZayfU_WiTMFw%40mail.gmail.com
> >> <https://groups.google.com/d/msgid/prometheus-developers/CABakzZb%2BX-ErewAKEyg54_FVRmTSypbnNFmR-8ZayfU_WiTMFw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> >> .
> >>
> >
> 
> -- 
> Brian Brazil
> www.robustperception.io
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/prometheus-developers/CAHJKeLouK0PKQMpmuWibEs3%3DDyrEXfN%2BbiUygfak4S_h0k30pw%40mail.gmail.com.

-- 
Julien Pivotto
@roidelapluie

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/20200811100748.GA255405%40oxygen.

Re: [prometheus-developers] Re: Remote Write Metadata propagation

Reply via email to