Re: [prometheus-users] Calculating Availability SLA over multiple VMs

Roland V Mon, 16 Mar 2020 14:02:23 -0700

Hi Debashish,

The way we did SLA reporting on our side was:


   - export an '*_up' metric for the VMs giving a value of 1 or 0
   - create silences via Alertmanager for maintenance periods, and ensure 
   they contain matchers that help identify the VMs (we used matchers like 
   'resource_group' & 'resource_name' as the machines run in Azure)
   - export silences just like machine state via: 
   https://github.com/FXinnovation/alertmanager-silences-exporter
   the exporter will give you a value of 1 in case the silence is active, 
   and 0 for all other states.
   - create a recording rule to check if a VM is in an 'up', 'down' or 
   'under maintenance' state. We use the metric created here for the time 
   range we want to calculate the SLA.
   - share results via Grafana to our clients

Hope this helps!

Thanks,
Roland

On Monday, March 16, 2020 at 4:44:01 PM UTC-4, Christian Hoffmann wrote:
>
> Hi, 
>
> On 3/16/20 9:21 PM, Debashish Ghosh wrote: 
> >   I am currently using spring's actuator/micrometer to spit out metrics 
> > that are scraped by prometheus. 
> > The framework generates a metric called *process_uptime_seconds* which 
> > is the number of seconds my app is running in a VM . I have *2 VMs* 
> > where my app is running to provide high availability of 99.95 %. 
> > 
> > I am using the formula *100-(((30*24*60*60) - 
> > 
> increase(process_uptime_seconds{job="Interop-InboundApi"}[30d]))/(30*24*60*60))*100
>  
>
> > *to calculate the SLA. 
> > 
> > 30*24*60*60 represents the number of sencods in 30 days and the 
> > difference with the process_uptime_seconds will give the number of 
> > seconds the app was down in a VM . 
> > 
> > But the problem with this approach is that periodically we have to 
> > *restart *the service to apply patch and while doing so we do it one by 
> > one so that there is no downtime. 
> > 
> > But since the above formula creates one timeseries for each VM instance 
> > the SLA goes down since both the servers are restarted one after the 
> > another. 
> > 
> > Is there a way to take this into consideration to calculate sla based on 
> > the time*when both the servers were down together *? 
> Hrm, can't you just use the up metric to detect whether your application 
> was available? 
>
> That way, you could calculate availability of your service via 
> max(up{instance=~"server1|server2"}) == 1. I think that would make the 
> whole thing much easier, wouldn't it? 
>
> I fail to come up with an idea based on your process_uptime_seconds 
> approach. It may be possible (maybe using a recording rule which decides 
> for each evaluation interval whether your servers cound as available or 
> not...?), but it sounds like it would get complicated quickly. 
>
>
> Kind regards, 
> Christian 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/632327fa-2526-4744-9268-500d6d1b1707%40googlegroups.com.

Re: [prometheus-users] Calculating Availability SLA over multiple VMs

Reply via email to