Hi, We have a dedicated job that collects disks metrics:
- job_name: node_disks
params:
collect[]:
- diskstats
- filefd
- filesystem
- mdadm
- mountstats
- nfs
- nfsd
- job_name: node
params:
collect[]:
- arp
- bonding
- conntrack
- cpu
- entropy
- hwmon
- infiniband
- loadavg
- meminfo
- netclass
- netdev
- netstat
- ntp
- processes
- sockstat
- stat
- textfile
- time
- timex
- uname
- vmstat
- xfs
stale nfs will usually be noticed:
up{job="node_disks"}==0 and
label_replace(up{job="node"}==1,"job","node_disks","","")
and second rule:
node_filesystem_avail_bytes offset 8h unless node_filesystem_avail_bytes and
on(job, instance) up == 1
Those two expression seem to have worked fine for us in the past.
On 03 Mar 18:11, Ben Kochie wrote:
> We added some mitigation for filesystem hangs. The node_exporter will
> notice a stuck filesystem and stop attempting to gather metrics from it
> until it gets un-stuck. Although, I don't think we have any metrics for
> when that happens, only log errors.
>
> On Tue, Mar 3, 2020 at 6:03 PM Serkan Çoban <[email protected]> wrote:
>
> > if I remember correctly node exporter will hang too when an nfs share
> > hangs. maybe you can test it...
> >
> > On Tue, Mar 3, 2020 at 6:26 PM Yagyansh S. Kumar
> > <[email protected]> wrote:
> > >
> > > I also thought about doing the same, but I am keeping that as a last
> > resort because that would require me to push the script to all my 2500+
> > servers.
> > >
> > > On Tuesday, March 3, 2020 at 8:46:27 PM UTC+5:30, Murali Krishna
> > Kanagala wrote:
> > >>
> > >> I would write a small shell script that tries to write to the nfs
> > mount path and writes the status to a file which can be read by the text
> > file collector. And schedule that shell script cron. I think this is the
> > easiest solution.
> > >>
> > >> On Tue, Mar 3, 2020, 9:12 AM Yagyansh S. Kumar <[email protected]>
> > wrote:
> > >>>
> > >>> Already enabled the nfs and nfsd collectors. Till now I haven't found
> > anything that can accurately give me the information about NFS hang.
> > >>> Correct me if I am wrong, but I don't think it is a good indicator of
> > NFS hang as there may be times where no activity is happening on the NFS,
> > but that does not mean that NFS is hanged. (eg. I have 25 NFS mounts on one
> > of my servers, some of them are used rarely, so we won't find any
> > substantial IO on those mounts, but I need to know whether they are
> > accessible or not). Still, thanks for the suggestion, will try it out once.
> > >>>
> > >>>
> > >>> On Tuesday, March 3, 2020 at 8:35:03 PM UTC+5:30, Murali Krishna
> > Kanagala wrote:
> > >>>>
> > >>>> Try enabling the nfs options in the node exporter config. It will
> > spit out some metrics about the nfs status.
> > >>>>
> > >>>> Also look at the disk IO metrics from node exporter and if you see no
> > activity which indicates the nfs is not doing anything.
> > >>>>
> > >>>> On Tue, Mar 3, 2020, 7:10 AM Yagyansh S. Kumar <[email protected]>
> > wrote:
> > >>>>>
> > >>>>> I want to check if the NFS is hanged(i.e whether it is accessible
> > from the server or not, and if yes then what is the response time it is
> > getting). I know using the mountstats and nfs collector we have a lot of
> > metrics for NFS, but haven't found any that can tell me every time the NFS
> > hangs correctly.
> > >>>>> Thanks in advance.
> > >>>>>
> > >>>>> --
> > >>>>> You received this message because you are subscribed to the Google
> > Groups "Prometheus Users" group.
> > >>>>> To unsubscribe from this group and stop receiving emails from it,
> > send an email to [email protected].
> > >>>>> To view this discussion on the web visit
> > https://groups.google.com/d/msgid/prometheus-users/06929518-d3b5-4c2f-9490-b08cc664d26b%40googlegroups.com
> > .
> > >>>
> > >>> --
> > >>> You received this message because you are subscribed to the Google
> > Groups "Prometheus Users" group.
> > >>> To unsubscribe from this group and stop receiving emails from it, send
> > an email to [email protected].
> > >>> To view this discussion on the web visit
> > https://groups.google.com/d/msgid/prometheus-users/1dda60cc-0b20-47da-87ff-4f1c76ce076f%40googlegroups.com
> > .
> > >
> > > --
> > > You received this message because you are subscribed to the Google
> > Groups "Prometheus Users" group.
> > > To unsubscribe from this group and stop receiving emails from it, send
> > an email to [email protected].
> > > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/prometheus-users/832f2823-eab1-4f40-8f91-ddbc00190551%40googlegroups.com
> > .
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Prometheus Users" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [email protected].
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/prometheus-users/CAP9WWed%2BtxJVRSJc0mkCOkg6_neGAJRNEMq_hku87LPbYXAhjA%40mail.gmail.com
> > .
> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/CABbyFmqMKQXYNOfdr7BeFA%3Dx%3D5fY%2Bk4EQ8oprL0Wh-8SNqmvoA%40mail.gmail.com.
--
(o- Julien Pivotto
//\ Open-Source Consultant
V_/_ Inuits - https://www.inuits.eu
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/20200303171756.GA15634%40oxygen.
signature.asc
Description: PGP signature

