Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

Jan Pekař - Imatic Sat, 27 Apr 2019 01:57:05 -0700


On 26/04/2019 21.50, Gregory Farnum wrote:

On Fri, Apr 26, 2019 at 10:55 AM Jan Pekař - Imatic <jan.pe...@imatic.cz> wrote:

Hi,


yesterday my cluster reported slow request for minutes and after restarting 
OSDs (reporting slow requests) it stuck with peering PGs. Whole
cluster was not responding and IO stopped.

I also notice, that problem was with cephx - all OSDs were reporting the same 
(even the same number of secret_id)

cephx: verify_authorizer could not get service secret for service osd 
secret_id=14086
...... conn(0x559e15a50000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 
cs=0 l=1).handle_connect_msg: got bad authorizer
auth: could not find secret_id=14086

My questions are:

Why happened that?
Can I prevent cluster from stopping to work (with cephx enabled)?
How quickly are keys rotating/expiring and can I check problems with that 
anyhow?

I'm running NTP on nodes (and also ceph monitors), so time should not be the 
issue. I noticed, that some monitor nodes has no timezone set,
but I hope MONs are using UTC to distribute keys to clients. Or different 
timezone between MON and OSD can cause the problem?

Hmm yeah, it's probably not using UTC. (Despite it being good
practice, it's actually not an easy default to adhere to.) cephx
requires synchronized clocks and probably the same timezone (though I
can't swear to that.)

I "fixed" the problem by restarting monitors.

It happened for the second time during last 3 months, so I'm reporting it as 
issue, that can happen.

I also noticed in all OSDs logs

2019-04-25 10:06:55.652239 7faf00096700 -1 monclient: _check_auth_rotating 
possible clock skew, rotating keys expired way too early (before
2019-04-25 09:06:55.652222)

approximately 7 hours before problem occurred. I can see, that it related to 
the issue. But why 7 hours? Is there some timeout or grace
period of old keys usage before they are invalidated?

7 hours shouldn't be directly related. IIRC by default a new rotating
key is issued every hour, it gives out the current and next key on
request, and daemons accept keys within a half-hour offset of what
they believe the current time to be. Something like that.
-Greg

If it is like you wrote, UTC is not problem. I'm running cluster with this configuration over 1 year and there were only 2-3 incidents ofthis kind.

I changed timezone, restarted services and I will wait...
I forget to mention, that I'm running luminous.

JP

Thank you

With regards

Jan Pekar

--
============
Ing. Jan Pekař
jan.pe...@imatic.cz
----
Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz
============
--

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG stuck peering - OSD cephx: verify_authorizer key problem

Reply via email to