On Tue, 11 Aug 2020 at 04:09, Callum Styan <[email protected]> wrote:

> I'm hesitant to add anything that significantly increases the network
> bandwidth usage or remote write while at the same time not giving users a
> way to tune the usage to their needs.
>
> I agree with Brian that we don't want the protocol itself to become
> stateful by introducing something like negotiation. I'd also prefer not to
> introduce multiple ways to do things, though I'm hoping we can find a way
> to accommodate your use case while not ballooning average users network
> egress bill.
>
> I am fine with forcing the consuming end to be somewhat stateful like in
> the case of Josh's PR where all metadata is sent periodically and must be
> stored by the remote storage system.
>



> Overall I'd like to see some numbers regarding current network bandwidth
> of remote write, remote write with metadata via Josh's PR, and remote write
> with sending metadata for every series in a remote write payload.
>

I agree, I noticed that in Rob's PR and had the same thought.

Brian


>
> Rob, I'll review your PR tomorrow but it looks like Julien and Brian may
> already have that covered.
>
> On Sun, Aug 9, 2020 at 9:36 PM Rob Skillington <[email protected]>
> wrote:
>
>> Update: The PR now sends the fields over remote write from the WAL and
>> metadata
>> is also updated in the WAL when any field changes.
>>
>> Now opened the PR against the primary repo:
>> https://github.com/prometheus/prometheus/pull/7771
>>
>> I have tested this end-to-end with a modified M3 branch:
>> https://github.com/m3db/m3/compare/r/test-prometheus-metadata
>> > {... "msg":"received
>> series","labels":"{__name__="prometheus_rule_group_...
>> > iterations_total",instance="localhost:9090",job="prometheus01",role=...
>> > "remote"}","type":"counter","unit":"","help":"The total number of
>> scheduled...
>> > rule group evaluations, whether executed or missed."}
>>
>> Tests still haven't been updated. Please any feedback on the approach /
>> data structures would be greatly appreciated.
>>
>> Would be good to know what others thoughts are on next steps.
>>
>> On Sat, Aug 8, 2020 at 11:21 AM Rob Skillington <[email protected]>
>> wrote:
>>
>>> Here's a draft PR that builds that propagates metadata to the WAL and
>>> the WAL
>>> reader can read it back:
>>> https://github.com/robskillington/prometheus/pull/1/files
>>>
>>> Would like a little bit of feedback before on the datatypes and
>>> structure going
>>> further if folks are open to that.
>>>
>>> There's a few things not happening:
>>> - Remote write queue manager does not use or send these extra fields yet.
>>> - Head does not reset the "metadata" slice (not sure where "series"
>>> slice is
>>>   reset in the head for pending series writes to WAL, want to do in same
>>> place).
>>> - Metadata is not re-written on change yet.
>>> - Tests.
>>>
>>>
>>> On Sat, Aug 8, 2020 at 9:37 AM Rob Skillington <[email protected]>
>>> wrote:
>>>
>>>> Sounds good, I've updated the proposal with details on places in which
>>>> changes
>>>> are required given the new approach:
>>>>
>>>> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo/edit#
>>>>
>>>>
>>>> On Fri, Aug 7, 2020 at 2:09 PM Brian Brazil <
>>>> [email protected]> wrote:
>>>>
>>>>> On Fri, 7 Aug 2020 at 15:48, Rob Skillington <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> True - I mean this could also be a blacklist by config perhaps, so if
>>>>>> you
>>>>>> really don't want to have increased egress you can optionally turn
>>>>>> off sending
>>>>>> the TYPE, HELP, UNIT or send them at different frequencies via
>>>>>> config. We could
>>>>>> package some sensible defaults so folks don't need to update their
>>>>>> config.
>>>>>>
>>>>>> The main intention is to enable these added features and make it
>>>>>> possible for
>>>>>> various consumers to be able to adjust some of these parameters if
>>>>>> required
>>>>>> since backends can be so different in their implementation. For M3 I
>>>>>> would be
>>>>>> totally fine with the extra egress that should be mitigated fairly
>>>>>> considerably
>>>>>> by Snappy and the fact that HELP is common across certain metric
>>>>>> families and
>>>>>> receiving it every single Remote Write request.
>>>>>>
>>>>>
>>>>> That's really a micro-optimisation. If you are that worried about
>>>>> bandwidth you'd run a sidecar specific to your remote backend that was
>>>>> stateful and far more efficient overall. Sending the full label names and
>>>>> values on every request is going to be far more than the overhead of
>>>>> metadata on top of that, so I don't see a need as it stands for any of 
>>>>> this
>>>>> to be configurable.
>>>>>
>>>>> Brian
>>>>>
>>>>>
>>>>>>
>>>>>> On Fri, Aug 7, 2020 at 3:56 AM Brian Brazil <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> On Thu, 6 Aug 2020 at 22:58, Rob Skillington <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Björn,
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for the detailed response. I've had a few back and forths on
>>>>>>>> this with
>>>>>>>> Brian and Chris over IRC and CNCF Slack now too.
>>>>>>>>
>>>>>>>> I agree that fundamentally it seems naive to idealistically model
>>>>>>>> this around
>>>>>>>> per metric name. It needs to be per series given what may happen
>>>>>>>> w.r.t.
>>>>>>>> collision across targets, etc.
>>>>>>>>
>>>>>>>> Perhaps we can separate these discussions apart into two
>>>>>>>> considerations:
>>>>>>>>
>>>>>>>> 1) Modeling of the data such that it is kept around for
>>>>>>>> transmission (primarily
>>>>>>>> we're focused on WAL here).
>>>>>>>>
>>>>>>>> 2) Transmission (and of which you allude to has many areas for
>>>>>>>> improvement).
>>>>>>>>
>>>>>>>> For (1) - it seems like this needs to be done per time series,
>>>>>>>> thankfully we
>>>>>>>> actually already have modeled this to be stored per series data
>>>>>>>> just once in a
>>>>>>>> single WAL file. I will write up my proposal here, but it will
>>>>>>>> surmount to
>>>>>>>> essentially encoding the HELP, UNIT and TYPE to the WAL per series
>>>>>>>> similar to
>>>>>>>> how labels for a series are encoded once per series in the WAL.
>>>>>>>> Since this
>>>>>>>> optimization is in place, there's already a huge dampening effect
>>>>>>>> on how
>>>>>>>> expensive it is to write out data about a series (e.g. labels). We
>>>>>>>> can always
>>>>>>>> go and collect a sample WAL file and measure how much extra size
>>>>>>>> with/without
>>>>>>>> HELP, UNIT and TYPE this would add, but it seems like it won't
>>>>>>>> fundamentally
>>>>>>>> change the order of magnitude in terms of "information about a
>>>>>>>> timeseries
>>>>>>>> storage size" vs "datapoints about a timeseries storage size". One
>>>>>>>> extra change
>>>>>>>> would be re-encoding the series into the WAL if the HELP changed
>>>>>>>> for that
>>>>>>>> series, just so that when HELP does change it can be up to date
>>>>>>>> from the view
>>>>>>>> of whoever is reading the WAL (i.e. the Remote Write loop). Since
>>>>>>>> this entry
>>>>>>>> needs to be loaded into memory for Remote Write today anyway, with
>>>>>>>> string
>>>>>>>> interning as suggested by Chris, it won't change the memory profile
>>>>>>>> algorithmically of a Prometheus with Remote Write enabled. There
>>>>>>>> will be some
>>>>>>>> overhead that at most would likely be similar to the label data,
>>>>>>>> but we aren't
>>>>>>>> altering data structures (so won't change big-O magnitude of memory
>>>>>>>> being used),
>>>>>>>> we're adding fields to existing data structures that exist and
>>>>>>>> string interning
>>>>>>>> should actually make it much less onerous since there is a large
>>>>>>>> duplicative
>>>>>>>> effect with HELP among time series.
>>>>>>>>
>>>>>>>> For (2) - now we have basically TYPE, HELP and UNIT all available
>>>>>>>> for
>>>>>>>> transmission if we wanted to send it with every single datapoint.
>>>>>>>> While I think
>>>>>>>> we should definitely examine HPACK like compression features as you
>>>>>>>> mentioned
>>>>>>>> Björn, I think we should think more about separating that kind of
>>>>>>>> work into a
>>>>>>>> Milestone 2 where this is considered.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> For the time being it's very plausible
>>>>>>>> we could do some negotiation of the receiving Remote Write endpoint
>>>>>>>> by sending
>>>>>>>> a "GET" to the remote write endpoint and seeing if it responds with
>>>>>>>> a
>>>>>>>> "capabilities + preferences" response, and if the endpoint
>>>>>>>> specifies that it
>>>>>>>> would like to receive metadata all the time on every single request
>>>>>>>> and let
>>>>>>>> Snappy take care of keeping size not ballooning too much, or if it
>>>>>>>> would like
>>>>>>>> TYPE on every single datapoint, and HELP and UNIT every
>>>>>>>> DESIRED_SECONDS or so.
>>>>>>>> To enable a "send HELP every 10 minutes" feature we would have to
>>>>>>>> add to the
>>>>>>>> datastructure that holds the LABELS, TYPE, HELP and UNIT for each
>>>>>>>> series a
>>>>>>>> "last sent" timestamp to know when to resend to that backend, but
>>>>>>>> that seems
>>>>>>>> entirely plausible and would not use more than 4 extra bytes.
>>>>>>>>
>>>>>>>
>>>>>>> Negotiation is fundamentally stateful, as the process that receives
>>>>>>> the first request may be a very different one from the one that receives
>>>>>>> the second - such as if an upgrade is in progress. Remote write is 
>>>>>>> intended
>>>>>>> to be a very simple thing that's easy to implement on the receiver end 
>>>>>>> and
>>>>>>> is a send-only request-based protocol, so request-time negotiation is
>>>>>>> basically out. Any negotiation needs to happen via the config file, and
>>>>>>> even then it'd be better if nothing ever needed to be configured. 
>>>>>>> Getting
>>>>>>> all the users of a remote write to change their config file or restart 
>>>>>>> all
>>>>>>> their Prometheus servers is not an easy task after all.
>>>>>>>
>>>>>>> Brian
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> These thoughts are based on the discussion I've had and the
>>>>>>>> thoughts on this
>>>>>>>> thread. What's the feedback on this before I go ahead and
>>>>>>>> re-iterate the design
>>>>>>>> to more closely map to what I'm suggesting here?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Rob
>>>>>>>>
>>>>>>>> On Thu, Aug 6, 2020 at 2:01 PM Bjoern Rabenstein <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> On 03.08.20 03:04, Rob Skillington wrote:
>>>>>>>>> > Ok - I have a proposal which could be broken up into two pieces,
>>>>>>>>> first
>>>>>>>>> > delivering TYPE per datapoint, the second consistently and
>>>>>>>>> reliably HELP and
>>>>>>>>> > UNIT once per unique metric name:
>>>>>>>>> >
>>>>>>>>> https://docs.google.com/document/d/1LY8Im8UyIBn8e3LJ2jB-MoajXkfAqW2eKzY735aYxqo
>>>>>>>>> > /edit#heading=h.bik9uwphqy3g
>>>>>>>>>
>>>>>>>>> Thanks for the doc. I have commented on it, but while doing so, I
>>>>>>>>> felt
>>>>>>>>> the urge to comment more generally, which would not fit well into
>>>>>>>>> the
>>>>>>>>> margin of a Google doc. My thoughts are also a bit out of scope of
>>>>>>>>> Rob's design doc and more about the general topic of remote write
>>>>>>>>> and
>>>>>>>>> the equally general topic of metadata (about which we have an
>>>>>>>>> ongoing
>>>>>>>>> discussion among the Prometheus developers).
>>>>>>>>>
>>>>>>>>> Disclaimer: I don't know the remote-write protocol very well. My
>>>>>>>>> hope
>>>>>>>>> here is that my somewhat distant perspective is of some value as it
>>>>>>>>> allows to take a step back. However, I might just miss crucial
>>>>>>>>> details
>>>>>>>>> that completely invalidate my thoughts. We'll see...
>>>>>>>>>
>>>>>>>>> I do care a lot about metadata, though. (And ironically, the reason
>>>>>>>>> why I declared remote write "somebody else's problem" is that I've
>>>>>>>>> always disliked how it fundamentally ignores metadata.)
>>>>>>>>>
>>>>>>>>> Rob's document embraces the fact that metadata can change over
>>>>>>>>> time,
>>>>>>>>> but it assumes that at any given time, there is only one set of
>>>>>>>>> metadata per unique metric name. It takes into account that there
>>>>>>>>> can
>>>>>>>>> be drift, but it considers them an irregularity that will only
>>>>>>>>> happen
>>>>>>>>> occasionally and iron out over time.
>>>>>>>>>
>>>>>>>>> In practice, however, metadata can be legitimately and deliberately
>>>>>>>>> different for different time series of the same name.
>>>>>>>>> Instrumentation
>>>>>>>>> libraries and even the exposition format inherently require one
>>>>>>>>> set of
>>>>>>>>> metadata per metric name, but this is all only enforced (and meant
>>>>>>>>> to
>>>>>>>>> be enforced) _per target_. Once the samples are ingested (or even
>>>>>>>>> sent
>>>>>>>>> onwards via remote write), they have no notion of what target they
>>>>>>>>> came from. Furthermore, samples created by rule evaluation don't
>>>>>>>>> have
>>>>>>>>> an originating target in the first place. (Which raises the
>>>>>>>>> question
>>>>>>>>> of metadata for recording rules, which is another can of worms I'd
>>>>>>>>> like to open eventually...)
>>>>>>>>>
>>>>>>>>> (There is also the technical difficulty that the WAL has no notion
>>>>>>>>> of
>>>>>>>>> bundling or referencing all the series with the same metric name.
>>>>>>>>> That
>>>>>>>>> was commented about in the doc but is not my focus here.)
>>>>>>>>>
>>>>>>>>> Rob's doc sees TYPE as special because it is so cheap to just add
>>>>>>>>> to
>>>>>>>>> every data point. That's correct, but it's giving me an itch:
>>>>>>>>> Should
>>>>>>>>> we really create different ways of handling metadata, depending on
>>>>>>>>> its
>>>>>>>>> expected size?
>>>>>>>>>
>>>>>>>>> Compare this with labels. There is no upper limit to their number
>>>>>>>>> or
>>>>>>>>> size. Still, we have no plan of treating "large" labels differently
>>>>>>>>> from "short" labels.
>>>>>>>>>
>>>>>>>>> On top of that, we have by now gained the insight that metadata is
>>>>>>>>> changing over time and essentially has to be tracked per series.
>>>>>>>>>
>>>>>>>>> Or in other words: From a pure storage perspective, metadata
>>>>>>>>> behaves
>>>>>>>>> exactly the same as labels! (There are certainly huge differences
>>>>>>>>> semantically, but those only manifest themselves on the query
>>>>>>>>> level,
>>>>>>>>> i.e. how you treat it in PromQL etc.)
>>>>>>>>>
>>>>>>>>> (This is not exactly a new insight. This is more or less what I
>>>>>>>>> said
>>>>>>>>> during the 2016 dev summit, when we first discussed remote write.
>>>>>>>>> But
>>>>>>>>> I don't want to dwell on "told you so" moments... :o)
>>>>>>>>>
>>>>>>>>> There is a good reason why we don't just add metadata as "pseudo
>>>>>>>>> labels": As discussed a lot in the various design docs including
>>>>>>>>> Rob's
>>>>>>>>> one, it would blow up the data size significantly because HELP
>>>>>>>>> strings
>>>>>>>>> tend to be relatively long.
>>>>>>>>>
>>>>>>>>> And that's the point where I would like to take a step back: We are
>>>>>>>>> discussing to essentially treat something that is structurally the
>>>>>>>>> same thing in three different ways: Way 1 for labels as we know
>>>>>>>>> them. Way 2 for "small" metadata. Way 3 for "big" metadata.
>>>>>>>>>
>>>>>>>>> However, while labels tend to be shorter than HELP strings, there
>>>>>>>>> is
>>>>>>>>> the occasional use case with long or many labels. (Infamously, at
>>>>>>>>> SoundCloud, a binary accidentally put a whole HTML page into a
>>>>>>>>> label. That wasn't a use case, it was a bug, but the Prometheus
>>>>>>>>> server
>>>>>>>>> ingesting that was just chugging along as if nothing special had
>>>>>>>>> happened. It looked weird in the expression browser, though...) I'm
>>>>>>>>> sure any vendor offering Prometheus remote storage as a service
>>>>>>>>> will
>>>>>>>>> have a customer or two that use excessively long label names. If we
>>>>>>>>> have to deal with that, why not bite the bullet and treat metadata
>>>>>>>>> in
>>>>>>>>> the same way as labels in general? Or to phrase it in another way:
>>>>>>>>> Any
>>>>>>>>> solution for "big" metadata could be used for labels, too, to
>>>>>>>>> alleviate the pain with excessively long label names.
>>>>>>>>>
>>>>>>>>> Or most succintly: A robust and really good solution for
>>>>>>>>> "big" metadata in remote write will make remote write much more
>>>>>>>>> efficient if applied to labels, too.
>>>>>>>>>
>>>>>>>>> Imagine an NALSD tech interview question that boils down to "design
>>>>>>>>> Prometheus remote write". I bet that most of the better candidates
>>>>>>>>> will recognize that most of the payload will consist of series
>>>>>>>>> indentifiers (call them labels or whatever) and they will suggest
>>>>>>>>> to
>>>>>>>>> first transmit some kind of index and from then on only transmit
>>>>>>>>> short
>>>>>>>>> series IDs. The best candidates will then find out about all the
>>>>>>>>> problems with that: How to keep the protocol stateless, how to
>>>>>>>>> re-sync
>>>>>>>>> the index, how to update it if new series arrive etc. Those are
>>>>>>>>> certainly all good reasons why remote write as we know it does not
>>>>>>>>> transfer an index of series IDs.
>>>>>>>>>
>>>>>>>>> However, my point here is that we are now discussing exactly those
>>>>>>>>> problems when we talk about metadata transmission. Let's solve
>>>>>>>>> those
>>>>>>>>> problems and apply them to remote write in general!
>>>>>>>>>
>>>>>>>>> Some thoughts about that:
>>>>>>>>>
>>>>>>>>> Current remote write essentially transfers all labels for _every_
>>>>>>>>> sample. This works reasonably well. Even if metadata blows up the
>>>>>>>>> data
>>>>>>>>> size by 5x or 10x, transfering the whole index of metadata and
>>>>>>>>> labels
>>>>>>>>> should remain feasible as long as we do it less frequently than
>>>>>>>>> once
>>>>>>>>> every 10 samples. It's something that could be done each time a
>>>>>>>>> remote-write receiver connects. From then on, we "only" have to
>>>>>>>>> track
>>>>>>>>> when new series (or series with new metadata) show up and transfer
>>>>>>>>> those. (I know it's not trivial, but we are already discussing
>>>>>>>>> possible solutions in the various design docs.) Whenever a
>>>>>>>>> remote-write receiver gets out of sync for some reason, it can
>>>>>>>>> simply
>>>>>>>>> cut the connection and start with a complete re-sync again. As
>>>>>>>>> long as
>>>>>>>>> that doesn't happen more often than once every 10 samples, we still
>>>>>>>>> have a net gain. Combining this with sharding is another challenge,
>>>>>>>>> but it doesn't appear unsolveable.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Björn Rabenstein
>>>>>>>>> [PGP-ID] 0x851C3DA17D748D03
>>>>>>>>> [email] [email protected]
>>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "Prometheus Developers" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected]
>>>>>>>> .
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com
>>>>>>>> <https://groups.google.com/d/msgid/prometheus-developers/CABakzZaQGfVK5OAfKRP2nxBnp168GML5r_ok_f%3DyVeUdC6e2EQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Brian Brazil
>>>>>>> www.robustperception.io
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Brian Brazil
>>>>> www.robustperception.io
>>>>>
>>>> --
>> You received this message because you are subscribed to the Google Groups
>> "Prometheus Developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-developers/CABakzZb%2BX-ErewAKEyg54_FVRmTSypbnNFmR-8ZayfU_WiTMFw%40mail.gmail.com
>> <https://groups.google.com/d/msgid/prometheus-developers/CABakzZb%2BX-ErewAKEyg54_FVRmTSypbnNFmR-8ZayfU_WiTMFw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
Brian Brazil
www.robustperception.io

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CAHJKeLouK0PKQMpmuWibEs3%3DDyrEXfN%2BbiUygfak4S_h0k30pw%40mail.gmail.com.

Reply via email to