Re: [prometheus-developers] Re: Remote Write Metadata propagation

Brian Brazil Wed, 19 Aug 2020 01:53:33 -0700

On Wed, 19 Aug 2020 at 09:47, Rob Skillington <[email protected]> wrote:


> To add a bit more detail to that example, I was actually using a
> fairly tuned
> remote write queue config that sent large batches since the batch send
> deadline
> was set to 1 minute longer with a max samples per send of 5,000. Here's
> that
> config:
> ```
> remote_write:
>   - url: http://localhost:3030/remote/write
>     remote_timeout: 30s
>     queue_config:
>       capacity: 10000
>       max_shards: 10
>       min_shards: 3
>       max_samples_per_send: 5000
>       batch_send_deadline: 1m
>       min_backoff: 50ms
>       max_backoff: 1s
> ```
>
> Using the default config we get worse utilization for both before/after
> numbers
> but the delta/difference is less:
> - steady state ~177kb/sec without this change
> - steady state ~210kb/sec with this change
> - roughly 20% increase
>

I think 20% is okay all things considered.

Brian


>
> Using config:
> ```
> remote_write:
>   - url: http://localhost:3030/remote/write
>     remote_timeout: 30s
> ```
>
> Implicitly the values for this config is:
> - min shards 1
> - max shards 1000
> - max samples per send 100
> - capacity 500
> - batch send deadline 5s
> - min backoff 30ms
> - max backoff 100ms
>
> On Wed, Aug 19, 2020 at 4:26 AM Brian Brazil <
> [email protected]> wrote:
>
>> On Wed, 19 Aug 2020 at 09:20, Rob Skillington <[email protected]>
>> wrote:
>>
>>> Here's the results from testing:
>>> - node_exporter exporting 309 metrics each by turning on a lot of
>>> optional
>>>   collectors, all have help set, very few have unit set
>>> - running 8 on the host at 1s scrape interval, each with unique instance
>>> label
>>> - steady state ~137kb/sec without this change
>>> - steady state ~172kb/sec with this change
>>> - roughly 30% increase
>>>
>>> Graph here:
>>> https://github.com/prometheus/prometheus/pull/7771#issuecomment-675923976
>>>
>>> How do we want to proceed? This could be fairly close to the higher end
>>> of
>>> the spectrum in terms of expected increase given the node_exporter
>>> metrics
>>> density and fairly verbose metadata.
>>>
>>> Even having said that however 30% is a fairly big increase and
>>> relatively large
>>> egress cost to have to swallow without any way to back out of this
>>> behavior.
>>>
>>> What do folks think of next steps?
>>>
>>
>> It is on the high end, however this is going to be among the worst cases
>> as there's not going to be a lot of per-metric cardinality from the node
>> exporter. I bet if you greatly increased the number of targets (and reduced
>> the scrape interval to compensate) it'd be more reasonable. I think this is
>> just about okay.
>>
>> Brian
>>
>>
>>>
>>>
>>> On Tue, Aug 11, 2020 at 11:55 AM Rob Skillington <[email protected]>
>>> wrote:
>>>
>>>> Agreed - I'll see what I can do in getting some numbers for a workload
>>>> collecting cAdvisor metrics, it seems to have a significant amount of
>>>> HELP set:
>>>>
>>>> https://github.com/google/cadvisor/blob/8450c56c21bc5406e2df79a2162806b9a23ebd34/metrics/testdata/prometheus_metrics
>>>>
>>>>
>>>> On Tue, Aug 11, 2020 at 6:15 AM Brian Brazil <
>>>> [email protected]> wrote:
>>>>
>>>>> On Tue, 11 Aug 2020 at 11:07, Julien Pivotto <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> On 11 Aug 11:05, Brian Brazil wrote:
>>>>>> > On Tue, 11 Aug 2020 at 04:09, Callum Styan <[email protected]>
>>>>>> wrote:
>>>>>> >
>>>>>> > > I'm hesitant to add anything that significantly increases the
>>>>>> network
>>>>>> > > bandwidth usage or remote write while at the same time not giving
>>>>>> users a
>>>>>> > > way to tune the usage to their needs.
>>>>>> > >
>>>>>> > > I agree with Brian that we don't want the protocol itself to
>>>>>> become
>>>>>> > > stateful by introducing something like negotiation. I'd also
>>>>>> prefer not to
>>>>>> > > introduce multiple ways to do things, though I'm hoping we can
>>>>>> find a way
>>>>>> > > to accommodate your use case while not ballooning average users
>>>>>> network
>>>>>> > > egress bill.
>>>>>> > >
>>>>>> > > I am fine with forcing the consuming end to be somewhat stateful
>>>>>> like in
>>>>>> > > the case of Josh's PR where all metadata is sent periodically and
>>>>>> must be
>>>>>> > > stored by the remote storage system.
>>>>>> > >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > > Overall I'd like to see some numbers regarding current network
>>>>>> bandwidth
>>>>>> > > of remote write, remote write with metadata via Josh's PR, and
>>>>>> remote write
>>>>>> > > with sending metadata for every series in a remote write payload.
>>>>>> > >
>>>>>> >
>>>>>> > I agree, I noticed that in Rob's PR and had the same thought.
>>>>>>
>>>>>> Remote bandwidth are likely to affect only people using remote write.
>>>>>>
>>>>>> Getting a view on the on-disk size of the WAL would be great too, as
>>>>>> that will affect everyone.
>>>>>>
>>>>>
>>>>> I'm not worried about that, it's only really on series creation so
>>>>> won't be noticed unless you have really high levels of churn.
>>>>>
>>>>> Brian
>>>>>
>>>>>
>>>>>>
>>>>>> >
>>>>>> > Brian
>>>>>> >
>>>>>> >
>>>>>> > >
>>>>>> > > Rob, I'll review your PR tomorrow but it looks like Julien and
>>>>>> Brian may
>>>>>> > > already have that covered.
>>>>>> > >
>>>>>> > > On Sun, Aug 9, 2020 at 9:36 PM Rob Skillington <
>>>>>> [email protected]>
>>>>>> > > wrote:
>>>>>> > >
>>>>>> > >> Update: The PR now sends the fields over remote write from the
>>>>>> WAL and
>>>>>> > >> metadata
>>>>>> > >> is also updated in the WAL when any field changes.
>>>>>> > >>
>>>>>> > >> Now opened the PR against the primary repo:
>>>>>> > >> https://github.com/prometheus/prometheus/pull/7771
>>>>>> > >>
>>>>>> > >> I have tested this end-to-end with a modified M3 branch:
>>>>>> > >> https://github.com/m3db/m3/compare/r/test-prometheus-metadata
>>>>>> > >> > {... "msg":"received
>>>>>> > >> series","labels":"{__name__="prometheus_rule_group_...
>>>>>> > >> >
>>>>>> iterations_total",instance="localhost:9090",job="prometheus01",role=...
>>>>>> > >> > "remote"}","type":"counter","unit":"","help":"The total number
>>>>>> of
>>>>>> > >> scheduled...
>>>>>> > >> > rule group evaluations, whether executed or missed."}
>>>>>> > >>
>>>>>> > >> Tests still haven't been updated. Please any feedback on the
>>>>>> approach /
>>>>>> > >> data structures would be greatly appreciated.
>>>>>> > >>
>>>>>> > >> Would be good to know what others thoughts are on next steps.
>>>>>> > >>
>>>>>> > >> On Sat, Aug 8, 2020 at 11:21 AM Rob Skillington <
>>>>>> [email protected]>
>>>>>> > >> wrote:
>>>>>> > >>
>>>>>> > >>> Here's a draft PR that builds that propagates metadata to the
>>>>>> WAL and
>>>>>> > >>> the WAL
>>>>>> > >>> reader can read it back:
>>>>>> > >>> https://github.com/robskillington/prometheus/pull/1/files
>>>>>> > >>>
>>>>>> > >>> Would like a little bit of feedback before on the datatypes and
>>>>>> > >>> structure going
>>>>>> > >>> further if folks are open to that.
>>>>>> > >>>
>>>>>> > >>> There's a few things not happening:
>>>>>> > >>> - Remote write queue manager does not use or send these extra
>>>>>> fields yet.
>>>>>> > >>> - Head does not reset the "metadata" slice (not sure where
>>>>>> "series"
>>>>>> > >>> slice is
>>>>>> > >>>   reset in the head for pending series writes to WAL, want to
>>>>>> do in same
>>>>>> > >>> place).
>>>>>> > >>> - Metadata is not re-written on change yet.
>>>>>> > >>> - Tests.
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>> On Sat, Aug 8, 2020 at 9:37 AM Rob Skillington <
>>>>>> [email protected]>
>>>>>> > >>> wrote:
>>>>>> > >>>
>>>>>> > >>>> Sounds good, I've updated the proposal with details on places
>>>>>> in which
>>>>>> > >>>> changes
>>>>>> > >>>> are required given the new approach:
>>>>>> > >>>>
>>>>>> > >>>>
>>>>>> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo/edit#
>>>>>> > >>>>
>>>>>> > >>>>
>>>>>> > >>>> On Fri, Aug 7, 2020 at 2:09 PM Brian Brazil <
>>>>>> > >>>> [email protected]> wrote:
>>>>>> > >>>>
>>>>>> > >>>>> On Fri, 7 Aug 2020 at 15:48, Rob Skillington <
>>>>>> [email protected]>
>>>>>> > >>>>> wrote:
>>>>>> > >>>>>
>>>>>> > >>>>>> True - I mean this could also be a blacklist by config
>>>>>> perhaps, so if
>>>>>> > >>>>>> you
>>>>>> > >>>>>> really don't want to have increased egress you can
>>>>>> optionally turn
>>>>>> > >>>>>> off sending
>>>>>> > >>>>>> the TYPE, HELP, UNIT or send them at different frequencies
>>>>>> via
>>>>>> > >>>>>> config. We could
>>>>>> > >>>>>> package some sensible defaults so folks don't need to update
>>>>>> their
>>>>>> > >>>>>> config.
>>>>>> > >>>>>>
>>>>>> > >>>>>> The main intention is to enable these added features and
>>>>>> make it
>>>>>> > >>>>>> possible for
>>>>>> > >>>>>> various consumers to be able to adjust some of these
>>>>>> parameters if
>>>>>> > >>>>>> required
>>>>>> > >>>>>> since backends can be so different in their implementation.
>>>>>> For M3 I
>>>>>> > >>>>>> would be
>>>>>> > >>>>>> totally fine with the extra egress that should be mitigated
>>>>>> fairly
>>>>>> > >>>>>> considerably
>>>>>> > >>>>>> by Snappy and the fact that HELP is common across certain
>>>>>> metric
>>>>>> > >>>>>> families and
>>>>>> > >>>>>> receiving it every single Remote Write request.
>>>>>> > >>>>>>
>>>>>> > >>>>>
>>>>>> > >>>>> That's really a micro-optimisation. If you are that worried
>>>>>> about
>>>>>> > >>>>> bandwidth you'd run a sidecar specific to your remote backend
>>>>>> that was
>>>>>> > >>>>> stateful and far more efficient overall. Sending the full
>>>>>> label names and
>>>>>> > >>>>> values on every request is going to be far more than the
>>>>>> overhead of
>>>>>> > >>>>> metadata on top of that, so I don't see a need as it stands
>>>>>> for any of this
>>>>>> > >>>>> to be configurable.
>>>>>> > >>>>>
>>>>>> > >>>>> Brian
>>>>>> > >>>>>
>>>>>> > >>>>>
>>>>>> > >>>>>>
>>>>>> > >>>>>> On Fri, Aug 7, 2020 at 3:56 AM Brian Brazil <
>>>>>> > >>>>>> [email protected]> wrote:
>>>>>> > >>>>>>
>>>>>> > >>>>>>> On Thu, 6 Aug 2020 at 22:58, Rob Skillington <
>>>>>> [email protected]>
>>>>>> > >>>>>>> wrote:
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>> Hey Björn,
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> Thanks for the detailed response. I've had a few back and
>>>>>> forths on
>>>>>> > >>>>>>>> this with
>>>>>> > >>>>>>>> Brian and Chris over IRC and CNCF Slack now too.
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> I agree that fundamentally it seems naive to
>>>>>> idealistically model
>>>>>> > >>>>>>>> this around
>>>>>> > >>>>>>>> per metric name. It needs to be per series given what may
>>>>>> happen
>>>>>> > >>>>>>>> w.r.t.
>>>>>> > >>>>>>>> collision across targets, etc.
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> Perhaps we can separate these discussions apart into two
>>>>>> > >>>>>>>> considerations:
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> 1) Modeling of the data such that it is kept around for
>>>>>> > >>>>>>>> transmission (primarily
>>>>>> > >>>>>>>> we're focused on WAL here).
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> 2) Transmission (and of which you allude to has many areas
>>>>>> for
>>>>>> > >>>>>>>> improvement).
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> For (1) - it seems like this needs to be done per time
>>>>>> series,
>>>>>> > >>>>>>>> thankfully we
>>>>>> > >>>>>>>> actually already have modeled this to be stored per series
>>>>>> data
>>>>>> > >>>>>>>> just once in a
>>>>>> > >>>>>>>> single WAL file. I will write up my proposal here, but it
>>>>>> will
>>>>>> > >>>>>>>> surmount to
>>>>>> > >>>>>>>> essentially encoding the HELP, UNIT and TYPE to the WAL
>>>>>> per series
>>>>>> > >>>>>>>> similar to
>>>>>> > >>>>>>>> how labels for a series are encoded once per series in the
>>>>>> WAL.
>>>>>> > >>>>>>>> Since this
>>>>>> > >>>>>>>> optimization is in place, there's already a huge dampening
>>>>>> effect
>>>>>> > >>>>>>>> on how
>>>>>> > >>>>>>>> expensive it is to write out data about a series (e.g.
>>>>>> labels). We
>>>>>> > >>>>>>>> can always
>>>>>> > >>>>>>>> go and collect a sample WAL file and measure how much
>>>>>> extra size
>>>>>> > >>>>>>>> with/without
>>>>>> > >>>>>>>> HELP, UNIT and TYPE this would add, but it seems like it
>>>>>> won't
>>>>>> > >>>>>>>> fundamentally
>>>>>> > >>>>>>>> change the order of magnitude in terms of "information
>>>>>> about a
>>>>>> > >>>>>>>> timeseries
>>>>>> > >>>>>>>> storage size" vs "datapoints about a timeseries storage
>>>>>> size". One
>>>>>> > >>>>>>>> extra change
>>>>>> > >>>>>>>> would be re-encoding the series into the WAL if the HELP
>>>>>> changed
>>>>>> > >>>>>>>> for that
>>>>>> > >>>>>>>> series, just so that when HELP does change it can be up to
>>>>>> date
>>>>>> > >>>>>>>> from the view
>>>>>> > >>>>>>>> of whoever is reading the WAL (i.e. the Remote Write
>>>>>> loop). Since
>>>>>> > >>>>>>>> this entry
>>>>>> > >>>>>>>> needs to be loaded into memory for Remote Write today
>>>>>> anyway, with
>>>>>> > >>>>>>>> string
>>>>>> > >>>>>>>> interning as suggested by Chris, it won't change the
>>>>>> memory profile
>>>>>> > >>>>>>>> algorithmically of a Prometheus with Remote Write enabled.
>>>>>> There
>>>>>> > >>>>>>>> will be some
>>>>>> > >>>>>>>> overhead that at most would likely be similar to the label
>>>>>> data,
>>>>>> > >>>>>>>> but we aren't
>>>>>> > >>>>>>>> altering data structures (so won't change big-O magnitude
>>>>>> of memory
>>>>>> > >>>>>>>> being used),
>>>>>> > >>>>>>>> we're adding fields to existing data structures that exist
>>>>>> and
>>>>>> > >>>>>>>> string interning
>>>>>> > >>>>>>>> should actually make it much less onerous since there is a
>>>>>> large
>>>>>> > >>>>>>>> duplicative
>>>>>> > >>>>>>>> effect with HELP among time series.
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> For (2) - now we have basically TYPE, HELP and UNIT all
>>>>>> available
>>>>>> > >>>>>>>> for
>>>>>> > >>>>>>>> transmission if we wanted to send it with every single
>>>>>> datapoint.
>>>>>> > >>>>>>>> While I think
>>>>>> > >>>>>>>> we should definitely examine HPACK like compression
>>>>>> features as you
>>>>>> > >>>>>>>> mentioned
>>>>>> > >>>>>>>> Björn, I think we should think more about separating that
>>>>>> kind of
>>>>>> > >>>>>>>> work into a
>>>>>> > >>>>>>>> Milestone 2 where this is considered.
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>> For the time being it's very plausible
>>>>>> > >>>>>>>> we could do some negotiation of the receiving Remote Write
>>>>>> endpoint
>>>>>> > >>>>>>>> by sending
>>>>>> > >>>>>>>> a "GET" to the remote write endpoint and seeing if it
>>>>>> responds with
>>>>>> > >>>>>>>> a
>>>>>> > >>>>>>>> "capabilities + preferences" response, and if the endpoint
>>>>>> > >>>>>>>> specifies that it
>>>>>> > >>>>>>>> would like to receive metadata all the time on every
>>>>>> single request
>>>>>> > >>>>>>>> and let
>>>>>> > >>>>>>>> Snappy take care of keeping size not ballooning too much,
>>>>>> or if it
>>>>>> > >>>>>>>> would like
>>>>>> > >>>>>>>> TYPE on every single datapoint, and HELP and UNIT every
>>>>>> > >>>>>>>> DESIRED_SECONDS or so.
>>>>>> > >>>>>>>> To enable a "send HELP every 10 minutes" feature we would
>>>>>> have to
>>>>>> > >>>>>>>> add to the
>>>>>> > >>>>>>>> datastructure that holds the LABELS, TYPE, HELP and UNIT
>>>>>> for each
>>>>>> > >>>>>>>> series a
>>>>>> > >>>>>>>> "last sent" timestamp to know when to resend to that
>>>>>> backend, but
>>>>>> > >>>>>>>> that seems
>>>>>> > >>>>>>>> entirely plausible and would not use more than 4 extra
>>>>>> bytes.
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> Negotiation is fundamentally stateful, as the process that
>>>>>> receives
>>>>>> > >>>>>>> the first request may be a very different one from the one
>>>>>> that receives
>>>>>> > >>>>>>> the second - such as if an upgrade is in progress. Remote
>>>>>> write is intended
>>>>>> > >>>>>>> to be a very simple thing that's easy to implement on the
>>>>>> receiver end and
>>>>>> > >>>>>>> is a send-only request-based protocol, so request-time
>>>>>> negotiation is
>>>>>> > >>>>>>> basically out. Any negotiation needs to happen via the
>>>>>> config file, and
>>>>>> > >>>>>>> even then it'd be better if nothing ever needed to be
>>>>>> configured. Getting
>>>>>> > >>>>>>> all the users of a remote write to change their config file
>>>>>> or restart all
>>>>>> > >>>>>>> their Prometheus servers is not an easy task after all.
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> Brian
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> These thoughts are based on the discussion I've had and the
>>>>>> > >>>>>>>> thoughts on this
>>>>>> > >>>>>>>> thread. What's the feedback on this before I go ahead and
>>>>>> > >>>>>>>> re-iterate the design
>>>>>> > >>>>>>>> to more closely map to what I'm suggesting here?
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> Best,
>>>>>> > >>>>>>>> Rob
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>> On Thu, Aug 6, 2020 at 2:01 PM Bjoern Rabenstein <
>>>>>> > >>>>>>>> [email protected]> wrote:
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>>> On 03.08.20 03:04, Rob Skillington wrote:
>>>>>> > >>>>>>>>> > Ok - I have a proposal which could be broken up into
>>>>>> two pieces,
>>>>>> > >>>>>>>>> first
>>>>>> > >>>>>>>>> > delivering TYPE per datapoint, the second consistently
>>>>>> and
>>>>>> > >>>>>>>>> reliably HELP and
>>>>>> > >>>>>>>>> > UNIT once per unique metric name:
>>>>>> > >>>>>>>>> >
>>>>>> > >>>>>>>>>
>>>>>> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo
>>>>>> > >>>>>>>>> > /edit#heading=h.bik9uwphqy3g
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> Thanks for the doc. I have commented on it, but while
>>>>>> doing so, I
>>>>>> > >>>>>>>>> felt
>>>>>> > >>>>>>>>> the urge to comment more generally, which would not fit
>>>>>> well into
>>>>>> > >>>>>>>>> the
>>>>>> > >>>>>>>>> margin of a Google doc. My thoughts are also a bit out of
>>>>>> scope of
>>>>>> > >>>>>>>>> Rob's design doc and more about the general topic of
>>>>>> remote write
>>>>>> > >>>>>>>>> and
>>>>>> > >>>>>>>>> the equally general topic of metadata (about which we
>>>>>> have an
>>>>>> > >>>>>>>>> ongoing
>>>>>> > >>>>>>>>> discussion among the Prometheus developers).
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> Disclaimer: I don't know the remote-write protocol very
>>>>>> well. My
>>>>>> > >>>>>>>>> hope
>>>>>> > >>>>>>>>> here is that my somewhat distant perspective is of some
>>>>>> value as it
>>>>>> > >>>>>>>>> allows to take a step back. However, I might just miss
>>>>>> crucial
>>>>>> > >>>>>>>>> details
>>>>>> > >>>>>>>>> that completely invalidate my thoughts. We'll see...
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> I do care a lot about metadata, though. (And ironically,
>>>>>> the reason
>>>>>> > >>>>>>>>> why I declared remote write "somebody else's problem" is
>>>>>> that I've
>>>>>> > >>>>>>>>> always disliked how it fundamentally ignores metadata.)
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> Rob's document embraces the fact that metadata can change
>>>>>> over
>>>>>> > >>>>>>>>> time,
>>>>>> > >>>>>>>>> but it assumes that at any given time, there is only one
>>>>>> set of
>>>>>> > >>>>>>>>> metadata per unique metric name. It takes into account
>>>>>> that there
>>>>>> > >>>>>>>>> can
>>>>>> > >>>>>>>>> be drift, but it considers them an irregularity that will
>>>>>> only
>>>>>> > >>>>>>>>> happen
>>>>>> > >>>>>>>>> occasionally and iron out over time.
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> In practice, however, metadata can be legitimately and
>>>>>> deliberately
>>>>>> > >>>>>>>>> different for different time series of the same name.
>>>>>> > >>>>>>>>> Instrumentation
>>>>>> > >>>>>>>>> libraries and even the exposition format inherently
>>>>>> require one
>>>>>> > >>>>>>>>> set of
>>>>>> > >>>>>>>>> metadata per metric name, but this is all only enforced
>>>>>> (and meant
>>>>>> > >>>>>>>>> to
>>>>>> > >>>>>>>>> be enforced) _per target_. Once the samples are ingested
>>>>>> (or even
>>>>>> > >>>>>>>>> sent
>>>>>> > >>>>>>>>> onwards via remote write), they have no notion of what
>>>>>> target they
>>>>>> > >>>>>>>>> came from. Furthermore, samples created by rule
>>>>>> evaluation don't
>>>>>> > >>>>>>>>> have
>>>>>> > >>>>>>>>> an originating target in the first place. (Which raises
>>>>>> the
>>>>>> > >>>>>>>>> question
>>>>>> > >>>>>>>>> of metadata for recording rules, which is another can of
>>>>>> worms I'd
>>>>>> > >>>>>>>>> like to open eventually...)
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> (There is also the technical difficulty that the WAL has
>>>>>> no notion
>>>>>> > >>>>>>>>> of
>>>>>> > >>>>>>>>> bundling or referencing all the series with the same
>>>>>> metric name.
>>>>>> > >>>>>>>>> That
>>>>>> > >>>>>>>>> was commented about in the doc but is not my focus here.)
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> Rob's doc sees TYPE as special because it is so cheap to
>>>>>> just add
>>>>>> > >>>>>>>>> to
>>>>>> > >>>>>>>>> every data point. That's correct, but it's giving me an
>>>>>> itch:
>>>>>> > >>>>>>>>> Should
>>>>>> > >>>>>>>>> we really create different ways of handling metadata,
>>>>>> depending on
>>>>>> > >>>>>>>>> its
>>>>>> > >>>>>>>>> expected size?
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> Compare this with labels. There is no upper limit to
>>>>>> their number
>>>>>> > >>>>>>>>> or
>>>>>> > >>>>>>>>> size. Still, we have no plan of treating "large" labels
>>>>>> differently
>>>>>> > >>>>>>>>> from "short" labels.
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> On top of that, we have by now gained the insight that
>>>>>> metadata is
>>>>>> > >>>>>>>>> changing over time and essentially has to be tracked per
>>>>>> series.
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> Or in other words: From a pure storage perspective,
>>>>>> metadata
>>>>>> > >>>>>>>>> behaves
>>>>>> > >>>>>>>>> exactly the same as labels! (There are certainly huge
>>>>>> differences
>>>>>> > >>>>>>>>> semantically, but those only manifest themselves on the
>>>>>> query
>>>>>> > >>>>>>>>> level,
>>>>>> > >>>>>>>>> i.e. how you treat it in PromQL etc.)
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> (This is not exactly a new insight. This is more or less
>>>>>> what I
>>>>>> > >>>>>>>>> said
>>>>>> > >>>>>>>>> during the 2016 dev summit, when we first discussed
>>>>>> remote write.
>>>>>> > >>>>>>>>> But
>>>>>> > >>>>>>>>> I don't want to dwell on "told you so" moments... :o)
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> There is a good reason why we don't just add metadata as
>>>>>> "pseudo
>>>>>> > >>>>>>>>> labels": As discussed a lot in the various design docs
>>>>>> including
>>>>>> > >>>>>>>>> Rob's
>>>>>> > >>>>>>>>> one, it would blow up the data size significantly because
>>>>>> HELP
>>>>>> > >>>>>>>>> strings
>>>>>> > >>>>>>>>> tend to be relatively long.
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> And that's the point where I would like to take a step
>>>>>> back: We are
>>>>>> > >>>>>>>>> discussing to essentially treat something that is
>>>>>> structurally the
>>>>>> > >>>>>>>>> same thing in three different ways: Way 1 for labels as
>>>>>> we know
>>>>>> > >>>>>>>>> them. Way 2 for "small" metadata. Way 3 for "big"
>>>>>> metadata.
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> However, while labels tend to be shorter than HELP
>>>>>> strings, there
>>>>>> > >>>>>>>>> is
>>>>>> > >>>>>>>>> the occasional use case with long or many labels.
>>>>>> (Infamously, at
>>>>>> > >>>>>>>>> SoundCloud, a binary accidentally put a whole HTML page
>>>>>> into a
>>>>>> > >>>>>>>>> label. That wasn't a use case, it was a bug, but the
>>>>>> Prometheus
>>>>>> > >>>>>>>>> server
>>>>>> > >>>>>>>>> ingesting that was just chugging along as if nothing
>>>>>> special had
>>>>>> > >>>>>>>>> happened. It looked weird in the expression browser,
>>>>>> though...) I'm
>>>>>> > >>>>>>>>> sure any vendor offering Prometheus remote storage as a
>>>>>> service
>>>>>> > >>>>>>>>> will
>>>>>> > >>>>>>>>> have a customer or two that use excessively long label
>>>>>> names. If we
>>>>>> > >>>>>>>>> have to deal with that, why not bite the bullet and treat
>>>>>> metadata
>>>>>> > >>>>>>>>> in
>>>>>> > >>>>>>>>> the same way as labels in general? Or to phrase it in
>>>>>> another way:
>>>>>> > >>>>>>>>> Any
>>>>>> > >>>>>>>>> solution for "big" metadata could be used for labels,
>>>>>> too, to
>>>>>> > >>>>>>>>> alleviate the pain with excessively long label names.
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> Or most succintly: A robust and really good solution for
>>>>>> > >>>>>>>>> "big" metadata in remote write will make remote write
>>>>>> much more
>>>>>> > >>>>>>>>> efficient if applied to labels, too.
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> Imagine an NALSD tech interview question that boils down
>>>>>> to "design
>>>>>> > >>>>>>>>> Prometheus remote write". I bet that most of the better
>>>>>> candidates
>>>>>> > >>>>>>>>> will recognize that most of the payload will consist of
>>>>>> series
>>>>>> > >>>>>>>>> indentifiers (call them labels or whatever) and they will
>>>>>> suggest
>>>>>> > >>>>>>>>> to
>>>>>> > >>>>>>>>> first transmit some kind of index and from then on only
>>>>>> transmit
>>>>>> > >>>>>>>>> short
>>>>>> > >>>>>>>>> series IDs. The best candidates will then find out about
>>>>>> all the
>>>>>> > >>>>>>>>> problems with that: How to keep the protocol stateless,
>>>>>> how to
>>>>>> > >>>>>>>>> re-sync
>>>>>> > >>>>>>>>> the index, how to update it if new series arrive etc.
>>>>>> Those are
>>>>>> > >>>>>>>>> certainly all good reasons why remote write as we know it
>>>>>> does not
>>>>>> > >>>>>>>>> transfer an index of series IDs.
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> However, my point here is that we are now discussing
>>>>>> exactly those
>>>>>> > >>>>>>>>> problems when we talk about metadata transmission. Let's
>>>>>> solve
>>>>>> > >>>>>>>>> those
>>>>>> > >>>>>>>>> problems and apply them to remote write in general!
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> Some thoughts about that:
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> Current remote write essentially transfers all labels for
>>>>>> _every_
>>>>>> > >>>>>>>>> sample. This works reasonably well. Even if metadata
>>>>>> blows up the
>>>>>> > >>>>>>>>> data
>>>>>> > >>>>>>>>> size by 5x or 10x, transfering the whole index of
>>>>>> metadata and
>>>>>> > >>>>>>>>> labels
>>>>>> > >>>>>>>>> should remain feasible as long as we do it less
>>>>>> frequently than
>>>>>> > >>>>>>>>> once
>>>>>> > >>>>>>>>> every 10 samples. It's something that could be done each
>>>>>> time a
>>>>>> > >>>>>>>>> remote-write receiver connects. From then on, we "only"
>>>>>> have to
>>>>>> > >>>>>>>>> track
>>>>>> > >>>>>>>>> when new series (or series with new metadata) show up and
>>>>>> transfer
>>>>>> > >>>>>>>>> those. (I know it's not trivial, but we are already
>>>>>> discussing
>>>>>> > >>>>>>>>> possible solutions in the various design docs.) Whenever a
>>>>>> > >>>>>>>>> remote-write receiver gets out of sync for some reason,
>>>>>> it can
>>>>>> > >>>>>>>>> simply
>>>>>> > >>>>>>>>> cut the connection and start with a complete re-sync
>>>>>> again. As
>>>>>> > >>>>>>>>> long as
>>>>>> > >>>>>>>>> that doesn't happen more often than once every 10
>>>>>> samples, we still
>>>>>> > >>>>>>>>> have a net gain. Combining this with sharding is another
>>>>>> challenge,
>>>>>> > >>>>>>>>> but it doesn't appear unsolveable.
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>>> --
>>>>>> > >>>>>>>>> Björn Rabenstein
>>>>>> > >>>>>>>>> [PGP-ID] 0x851C3DA17D748D03
>>>>>> > >>>>>>>>> [email] [email protected]
>>>>>> > >>>>>>>>>
>>>>>> > >>>>>>>> --
>>>>>> > >>>>>>>> You received this message because you are subscribed to
>>>>>> the Google
>>>>>> > >>>>>>>> Groups "Prometheus Developers" group.
>>>>>> > >>>>>>>> To unsubscribe from this group and stop receiving emails
>>>>>> from it,
>>>>>> > >>>>>>>> send an email to
>>>>>> [email protected]
>>>>>> > >>>>>>>> .
>>>>>> > >>>>>>>> To view this discussion on the web visit
>>>>>> > >>>>>>>>
>>>>>> https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com
>>>>>> > >>>>>>>> <
>>>>>> https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com?utm_medium=email&utm_source=footer
>>>>>> >
>>>>>> > >>>>>>>> .
>>>>>> > >>>>>>>>
>>>>>> > >>>>>>>
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> --
>>>>>> > >>>>>>> Brian Brazil
>>>>>> > >>>>>>> www.robustperception.io
>>>>>> > >>>>>>>
>>>>>> > >>>>>>
>>>>>> > >>>>>
>>>>>> > >>>>> --
>>>>>> > >>>>> Brian Brazil
>>>>>> > >>>>> www.robustperception.io
>>>>>> > >>>>>
>>>>>> > >>>> --
>>>>>> > >> You received this message because you are subscribed to the
>>>>>> Google Groups
>>>>>> > >> "Prometheus Developers" group.
>>>>>> > >> To unsubscribe from this group and stop receiving emails from
>>>>>> it, send an
>>>>>> > >> email to [email protected].
>>>>>> > >> To view this discussion on the web visit
>>>>>> > >>
>>>>>> https://groups.google.com/d/msgid/prometheus-developers/CABakzZb%2BX-ErewAKEyg54_FVRmTSypbnNFmR-8ZayfU_WiTMFw%40mail.gmail.com
>>>>>> > >> <
>>>>>> https://groups.google.com/d/msgid/prometheus-developers/CABakzZb%2BX-ErewAKEyg54_FVRmTSypbnNFmR-8ZayfU_WiTMFw%40mail.gmail.com?utm_medium=email&utm_source=footer
>>>>>> >
>>>>>> > >> .
>>>>>> > >>
>>>>>> > >
>>>>>> >
>>>>>> > --
>>>>>> > Brian Brazil
>>>>>> > www.robustperception.io
>>>>>> >
>>>>>> > --
>>>>>> > You received this message because you are subscribed to the Google
>>>>>> Groups "Prometheus Developers" group.
>>>>>> > To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> > To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/prometheus-developers/CAHJKeLouK0PKQMpmuWibEs3%3DDyrEXfN%2BbiUygfak4S_h0k30pw%40mail.gmail.com
>>>>>> .
>>>>>>
>>>>>> --
>>>>>> Julien Pivotto
>>>>>> @roidelapluie
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Brian Brazil
>>>>> www.robustperception.io
>>>>>
>>>>
>>
>> --
>> Brian Brazil
>> www.robustperception.io
>>
>

-- 
Brian Brazil
www.robustperception.io

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CAHJKeLoWTwM%2B1a-M%2BxPEyihxtYSvyna9m5F%3DXW_Sihs2zoLpgg%40mail.gmail.com.

Re: [prometheus-developers] Re: Remote Write Metadata propagation

Reply via email to