Hi all,
I'm hoping some of you have some experience in dealing with this, as
unfortunately this is the first time we encountered this issue.
We currently have placement groups that are stuck unclean with
'active+remapped' as last state.
The rundown of what happened:
Yesterday morning, one of our network engineers, was working on some LACP bonds
on the same switch stack which also houses this cluster's internal and public
ceph networks.
Unfortunately the engineer also accidentally touched the LACP bonds of all 3
monitor servers and issues started to appear.
In a rapid amount, we started losing osd's, one by one, and rebalance/recover
started kicking in.
As connectivity between de monitor servers appeared ok (ping connectivity was
somehow still there, there was still a quorum visible and ceph commands worked
on all three), we didn't suspect the monitor servers at first.
When investigating the osd's that were marked down, the logging of those osd's
were full with below error messages:
- monclient: _check_auth_rotating possible clock skew,
rotating keys expired way too early
- auth: could not find secret_id
- cephx: verify_authorizer could not get service secret for
service osd secret_id
- x.x.x.x:6801/1258067 >> x.x.x.x:0/1115346558
pipe(0x560136fd8800 sd=706 :6801 s=0 pgs=0 cs=0 l=1 c=0x560122778e80).accept:
got bad authorizer
We suspected time sync, but everything turned out ok.
As more and more osd's started failing, we changed the crushmap to add 2
additional osd nodes for the affected pools, that were not housing any data at
the moment, but the same message kept appearing on these osd's as well.
In the meantime enough osd's were down, so everything stopped in it's process.
After finding out about the LACP bonds, the changes were reverted and all osd's
came up again.
Unfortunately after some time rebalance/recover stopped and status gives the
following information:
health HEALTH_WARN
1088 pgs stuck unclean
recovery 92/1073206 objects degraded (0.009%)
recovery 53092/1073206 objects misplaced (4.947%)
nodeep-scrub,sortbitwise,require_jewel_osds flag(s) set
monmap e1: 3 mons at
{srv-ams3-cmon-01=192.168.152.3:6789/0,srv-ams3-cmon-02=192.168.152.4:6789/0,srv-ams3-cmon-03=192.168.152.5:6789/0}
election epoch 5152, quorum 0,1,2
srv-ams3-cmon-01,srv-ams3-cmon-02,srv-ams3-cmon-03
osdmap e30517: 39 osds: 39 up, 39 in; 1088 remapped pgs
flags nodeep-scrub,sortbitwise,require_jewel_osds
pgmap v20285289: 2340 pgs, 22 pools, 2056 GB data, 524 kobjects
4057 GB used, 12409 GB / 16466 GB avail
92/1073206 objects degraded (0.009%)
53092/1073206 objects misplaced (4.947%)
1252 active+clean
1088 active+remapped
There does not seem to be any issue that prevents continuous service of the
connected clients, but when querying such a placement group, it's show that:
- 2 osd's are acting (the pools have a replication size of 2 at the
moment)
- 1 osd is primary
- Both osd's are visible as value for 'actingbackfill'
- up_primary has the value '-1'
- None is up
We already tried reweighting the affected primary osd, but the affected
placement groups are not touched by the rebalance.
Restarting the osd's also did not have any affect.
We even tried 'ceph osd crush tunables optimal', but as we already though it
would not have any affect.
Sorry for the long read, but if someone might have an idea what we could try?
I did read about setting 'osd_find_best_info_ignore_history_les' to true, but
I'm not sure what the implications would be when using this setting.
Additionally we did set deep-scrub of during the recovery, could this be
something deep-scrub would fix?
Thanks in advance!
Roel
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com