Hey folks,
I'm staring at a problem that I have found no solution for and which is causing
major issues.
We've had a PG go down with the first 3 OSDs all crashing and coming back only
to crash again with the following error in their logs:
-1> 2017-08-22 17:27:50.961633 7f4af4057700 -1 osd.1290 pg_epoch: 72946
pg[1.138s0( v 72946'430011 (62760'421568,72
946'430011] local-les=72945 n=22918 ec=764 les/c/f 72945/72881/0
72942/72944/72944) [1290,927,672,456,177,1094,194,1513
,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326] r=0
lpr=72944 pi=72880-72943/24 bft=1513(7) crt=
72946'430011 lcod 72889'430010 mlcod 72889'430010
active+undersized+degraded+remapped+backfilling] recover_replicas: ob
ject added to missing set for backfill, but is not in recovering, error!
0> 2017-08-22 17:27:50.965861 7f4af4057700 -1 *** Caught signal (Aborted)
**
in thread 7f4af4057700 thread_name:tp_osd_tp
This has been going on over the weekend when we saw a different error message
before upgrading from 11.2.0 to 11.2.1.
The pool is running EC 8+3.
The OSDs crash with that error only to be restarted by systemd and fail again
the exact same way. Eventually systemd gives, the mon_osd_down_out_interval
expires and the PG just stays down+remapped while other recover and go
active+clean.
Can anybody help with this type of problem?
Best regards,
George Vasilakakos
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com