Hi all,

I have to get back to this case. On Monday I had to restart an MDS to get rid 
of a stuck client caps recall. Right after that fail-over, the MONs went into a 
voting frenzy again. I already restarted all of them like last time, but this 
time this doesn't help. I might be in a different case here.

In an effort to collect debug info, I set debug_mon on the leader to 10/10 and 
its producing voluminous output. Unfortunately, while debug_mon=10/10, the 
voting frenzy is not happening. It seems that I'm a bit in the situation 
described with "Tip: When debug output slows down your system, the latency can 
hide race conditions." at 
https://docs.ceph.com/en/octopus/rados/troubleshooting/log-and-debug/.

The election frequency is significantly lower when debug_mon=10/10. I managed 
to catch one though and pasted the 20s before the election happened here: 
https://pastebin.com/hGPvVkuR . I hope there is a clue, I can't see anything 
that sticks out.

Is there anything else I can look for?

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <fr...@dtu.dk>
Sent: Thursday, February 9, 2023 5:29 PM
To: Gregory Farnum; Dan van der Ster
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Frequent calling monitor election

Hi Dan and Gregory,

thanks! These are good pointers. Will look into that tomorrow.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Gregory Farnum <gfar...@redhat.com>
Sent: 09 February 2023 17:12:23
To: Dan van der Ster
Cc: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Frequent calling monitor election

Also, that the current leader (ceph-01) is one of the monitors
proposing an election each time suggests the problem is with getting
commit acks back from one of its followers.

On Thu, Feb 9, 2023 at 8:09 AM Dan van der Ster <dvand...@gmail.com> wrote:
>
> Hi Frank,
>
> Check the mon logs with some increased debug levels to find out what
> the leader is busy with.
> We have a similar issue (though, daily) and it turned out to be
> related to the mon leader timing out doing a SMART check.
> See https://tracker.ceph.com/issues/54313 for how I debugged that.
>
> Cheers, Dan
>
> On Thu, Feb 9, 2023 at 7:56 AM Frank Schilder <fr...@dtu.dk> wrote:
> >
> > Hi all,
> >
> > our monitors have enjoyed democracy since the beginning. However, I don't 
> > share a sudden excitement about voting:
> >
> > 2/9/23 4:42:30 PM[INF]overall HEALTH_OK
> > 2/9/23 4:42:30 PM[INF]mon.ceph-01 is new leader, mons 
> > ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
> > 2/9/23 4:42:26 PM[INF]mon.ceph-01 calling monitor election
> > 2/9/23 4:42:26 PM[INF]mon.ceph-26 calling monitor election
> > 2/9/23 4:42:26 PM[INF]mon.ceph-25 calling monitor election
> > 2/9/23 4:42:26 PM[INF]mon.ceph-02 calling monitor election
> > 2/9/23 4:40:00 PM[INF]overall HEALTH_OK
> > 2/9/23 4:30:00 PM[INF]overall HEALTH_OK
> > 2/9/23 4:24:34 PM[INF]overall HEALTH_OK
> > 2/9/23 4:24:34 PM[INF]mon.ceph-01 is new leader, mons 
> > ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
> > 2/9/23 4:24:29 PM[INF]mon.ceph-01 calling monitor election
> > 2/9/23 4:24:29 PM[INF]mon.ceph-02 calling monitor election
> > 2/9/23 4:24:29 PM[INF]mon.ceph-03 calling monitor election
> > 2/9/23 4:24:29 PM[INF]mon.ceph-01 calling monitor election
> > 2/9/23 4:24:29 PM[INF]mon.ceph-26 calling monitor election
> > 2/9/23 4:24:29 PM[INF]mon.ceph-25 calling monitor election
> > 2/9/23 4:24:29 PM[INF]mon.ceph-02 calling monitor election
> > 2/9/23 4:24:04 PM[INF]overall HEALTH_OK
> > 2/9/23 4:24:03 PM[INF]mon.ceph-01 is new leader, mons 
> > ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
> > 2/9/23 4:23:59 PM[INF]mon.ceph-01 calling monitor election
> > 2/9/23 4:23:59 PM[INF]mon.ceph-02 calling monitor election
> > 2/9/23 4:20:00 PM[INF]overall HEALTH_OK
> > 2/9/23 4:10:00 PM[INF]overall HEALTH_OK
> > 2/9/23 4:00:00 PM[INF]overall HEALTH_OK
> > 2/9/23 3:50:00 PM[INF]overall HEALTH_OK
> > 2/9/23 3:43:13 PM[INF]overall HEALTH_OK
> > 2/9/23 3:43:13 PM[INF]mon.ceph-01 is new leader, mons 
> > ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
> > 2/9/23 3:43:08 PM[INF]mon.ceph-01 calling monitor election
> > 2/9/23 3:43:08 PM[INF]mon.ceph-26 calling monitor election
> > 2/9/23 3:43:08 PM[INF]mon.ceph-25 calling monitor election
> >
> > We moved a switch from one rack to another and after the switch came beck 
> > up, the monitors frequently bitch about who is the alpha. How do I get them 
> > to focus more on their daily duties again?
> >
> > Thanks for any help!
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to