Hi Debashish, The way we did SLA reporting on our side was:
- export an '*_up' metric for the VMs giving a value of 1 or 0 - create silences via Alertmanager for maintenance periods, and ensure they contain matchers that help identify the VMs (we used matchers like 'resource_group' & 'resource_name' as the machines run in Azure) - export silences just like machine state via: https://github.com/FXinnovation/alertmanager-silences-exporter the exporter will give you a value of 1 in case the silence is active, and 0 for all other states. - create a recording rule to check if a VM is in an 'up', 'down' or 'under maintenance' state. We use the metric created here for the time range we want to calculate the SLA. - share results via Grafana to our clients Hope this helps! Thanks, Roland On Monday, March 16, 2020 at 4:44:01 PM UTC-4, Christian Hoffmann wrote: > > Hi, > > On 3/16/20 9:21 PM, Debashish Ghosh wrote: > > I am currently using spring's actuator/micrometer to spit out metrics > > that are scraped by prometheus. > > The framework generates a metric called *process_uptime_seconds* which > > is the number of seconds my app is running in a VM . I have *2 VMs* > > where my app is running to provide high availability of 99.95 %. > > > > I am using the formula *100-(((30*24*60*60) - > > > increase(process_uptime_seconds{job="Interop-InboundApi"}[30d]))/(30*24*60*60))*100 > > > > *to calculate the SLA. > > > > 30*24*60*60 represents the number of sencods in 30 days and the > > difference with the process_uptime_seconds will give the number of > > seconds the app was down in a VM . > > > > But the problem with this approach is that periodically we have to > > *restart *the service to apply patch and while doing so we do it one by > > one so that there is no downtime. > > > > But since the above formula creates one timeseries for each VM instance > > the SLA goes down since both the servers are restarted one after the > > another. > > > > Is there a way to take this into consideration to calculate sla based on > > the time*when both the servers were down together *? > Hrm, can't you just use the up metric to detect whether your application > was available? > > That way, you could calculate availability of your service via > max(up{instance=~"server1|server2"}) == 1. I think that would make the > whole thing much easier, wouldn't it? > > I fail to come up with an idea based on your process_uptime_seconds > approach. It may be possible (maybe using a recording rule which decides > for each evaluation interval whether your servers cound as available or > not...?), but it sounds like it would get complicated quickly. > > > Kind regards, > Christian > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/632327fa-2526-4744-9268-500d6d1b1707%40googlegroups.com.

