Did you finish the upgrade of the OSDs? Are OSDs flapping? (ceph -w) Is
there anything weird in the OSDs' log files?


Paul

2018-07-11 20:30 GMT+02:00 Magnus Grönlund <[email protected]>:

> Hi,
>
> Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)
>
> After upgrading and restarting the mons everything looked OK, the mons had
> quorum, all OSDs where up and in and all the PGs where active+clean.
> But before I had time to start upgrading the OSDs it became obvious that
> something had gone terribly wrong.
> All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data
> was misplaced!
>
> The mons appears OK and all OSDs are still up and in, but a few hours
> later there was still 1483 pgs stuck inactive, essentially all of them in
> peering!
> Investigating one of the stuck PGs it appears to be looping between
> “inactive”, “remapped+peering” and “peering” and the epoch number is rising
> fast, see the attached pg query outputs.
>
> We really can’t afford to loose the cluster or the data so any help or
> suggestions on how to debug or fix this issue would be very, very
> appreciated!
>
>
>     health: HEALTH_ERR
>             1483 pgs are stuck inactive for more than 60 seconds
>             542 pgs backfill_wait
>             14 pgs backfilling
>             11 pgs degraded
>             1402 pgs peering
>             3 pgs recovery_wait
>             11 pgs stuck degraded
>             1483 pgs stuck inactive
>             2042 pgs stuck unclean
>             7 pgs stuck undersized
>             7 pgs undersized
>             111 requests are blocked > 32 sec
>             10586 requests are blocked > 4096 sec
>             recovery 9472/11120724 objects degraded (0.085%)
>             recovery 1181567/11120724 objects misplaced (10.625%)
>             noout flag(s) set
>             mon.eselde02u32 low disk space
>
>   services:
>     mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
>     mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
>     osd: 111 osds: 111 up, 111 in; 800 remapped pgs
>          flags noout
>
>   data:
>     pools:   18 pools, 4104 pgs
>     objects: 3620k objects, 13875 GB
>     usage:   42254 GB used, 160 TB / 201 TB avail
>     pgs:     1.876% pgs unknown
>              34.259% pgs not active
>              9472/11120724 objects degraded (0.085%)
>              1181567/11120724 objects misplaced (10.625%)
>              2062 active+clean
>             1221 peering
>              535  active+remapped+backfill_wait
>              181  remapped+peering
>              77   unknown
>              13   active+remapped+backfilling
>              7    active+undersized+degraded+remapped+backfill_wait
>              4    remapped
>              3    active+recovery_wait+degraded+remapped
>              1    active+degraded+remapped+backfilling
>
>   io:
>     recovery: 298 MB/s, 77 objects/s
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to