Re: [prometheus-users] Is it okay to post a Prometheus user survey here?

'Tom Lee' via Prometheus Users Fri, 22 May 2020 07:44:30 -0700

Awesome, appreciate it. Thank you all so much for your help getting this
out!


On Fri, May 22, 2020 at 6:47 AM Julius Volz <[email protected]> wrote:

> Hi Tom,
>
> Just posted the final survey here:
>
> https://groups.google.com/forum/#!topic/prometheus-users/XU7tbVn23co
> https://groups.google.com/forum/#!topic/prometheus-developers/ToCQNP2mODQ
>
> Let's see what results look like, hope it's helpful although not all
> questions made it this time :)
>
> Regards,
> Julius
>
> On Fri, May 22, 2020 at 10:49 AM Julius Volz <[email protected]>
> wrote:
>
>> Yeah, I think as interesting as this could be, the survey is growing
>> quite large already, and this would be one of the more complicated
>> questions in terms of explaining it clearly enough and then getting users
>> to compile the results. So I'm tending towards leaving it out this time
>> around.
>>
>> But from experience you can safely assume that most large Prometheus
>> deployments have a few metric names that are huge in their number of series
>> (like a couple of 10k), and that would blow up any graph or other UI
>> display without aggregation / filtering.
>>
>> On Wed, May 20, 2020 at 7:00 PM Tom Lee <[email protected]> wrote:
>>
>>> Yeah, agree. I really like the "largest N metric names" idea. I think
>>> both total series and "top N metrics" are interesting for different
>>> reasons, but also agree getting "real" numbers is a challenge whatever we
>>> decide to do here. :)
>>>
>>> On Wed, May 20, 2020 at 6:38 AM Julius Volz <[email protected]>
>>> wrote:
>>>
>>>> On Sun, May 17, 2020 at 7:57 PM Tom Lee <[email protected]> wrote:
>>>>
>>>>> Yes, I'm interested in what Tom's intent is behind the question. From
>>>>>> a Prometheus perspective, the total time-series load is most important. 
>>>>>> But
>>>>>> it might be different for his use case.
>>>>>>
>>>>>
>>>>> Ah yep, really great question. I'm going to absolutely butcher the
>>>>> terminology here, but the idea is we're sort of trying to differentiate
>>>>> between "number of unique metric names" and "label/dimensional cardinality
>>>>> within those metrics". The reason for us differentiating is something of 
>>>>> an
>>>>> implementation detail with respect to our own systems, but I think it also
>>>>> applies somewhat to Prometheus and/or Grafana too: when you run a
>>>>> non-aggregating query for a metric *x*, you might expect to see one
>>>>> timeseries charted -- or you might see hundreds or even thousands. In our
>>>>> own test setup we have JMX metrics for 15 Kafka servers reporting in.
>>>>> Executing a "query" like *kafka_cluster_Partition_Value *(a metric
>>>>> reported by the JMX exporter on behalf of Kafka) yields something like
>>>>> 20,000-30,000 distinct timeseries charted by Prometheus. It spends a
>>>>> surprising amount of time to execute that simple little query as a result.
>>>>> This sort of cardinality "explosion" has big implications for system
>>>>> architecture and scalability in our own systems, too.
>>>>>
>>>>
>>>> Sorry for the delay! Yeah, makes sense, metric names that have many
>>>> series can be problematic in UIs when doing queries without filters or
>>>> aggregations. On the other hand, we know that having at least *some* of
>>>> those is very common (almost every user has a couple huge ones), so we
>>>> probably don't need a survey to tell us that :) More importantly maybe, to
>>>> see how many metrics are too "overloaded", just having the total number
>>>> metric names vs. the total number of series doesn't answer the question
>>>> fully: you don't know whether the series are evenly split up across your
>>>> metric names, or whether they're all clustered in a few names. It's also a
>>>> bit challenging to get users to compile a list of distinct metric names
>>>> across Prometheus servers, without some command-line foo or similar. We
>>>> could ask something along the lines of "How many series do your largest N
>>>> metric names contain?", and then give them a query like 'topk(3, count
>>>> by(__name__) ({__name__!=""}))' to determine that per server. It would
>>>> still require some manual work to combine results between servers though,
>>>> hmmm...
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAMUmz5gQLO48Pe5GhP2FPVVE3Q86vBRnv%3Dr23%3D6Kd3D22kVw2g%40mail.gmail.com.

Re: [prometheus-users] Is it okay to post a Prometheus user survey here?

Reply via email to