No, nothing like that.
The cluster is in the process of having more OSDs added and, while that was
ongoing, one was removed because the underlying disk was throwing up a bunch of
read errors.
Shortly after, the first three OSDs in this PG started crashing with error
messages about corrupted EC shards. We seemed to be running into
http://tracker.ceph.com/issues/18624 so we moved on to 11.2.1 which essentially
means they now fail with a different error message. Our problem looks a bit
like this: http://tracker.ceph.com/issues/18162
For a bit more context here's two more events going backwards in the dump:
-3> 2017-08-22 17:42:09.443216 7fa2e283d700 0 osd.1290 pg_epoch: 73324
pg[1.138s0( v 73085'430014 (62760'421568,73
085'430014] local-les=73323 n=22919 ec=764 les/c/f 73323/72881/0
73321/73322/73322) [1290,927,672,456,177,1094,194,1513
,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326] r=0
lpr=73322 pi=72880-73321/179 rops=1 bft=1513
(7) crt=73085'430014 lcod 0'0 mlcod 0'0
active+undersized+degraded+remapped+backfilling] failed_push 1:1c959fdd:::datad
isk%2frucio%2fmc16_13TeV%2f41%2f30%2fAOD.11927271._003020.pool.root.1.0000000000000000:head
from shard 177(4), reps on
unfound? 0
-2> 2017-08-22 17:42:09.443299 7fa2e283d700 5 -- op tracker -- seq: 490,
time: 2017-08-22 17:42:09.443297, event:
done, op: MOSDECSubOpReadReply(1.138s0 73324 ECSubReadReply(tid=5,
attrs_read=0))
No amount of taking OSDs out or restarting them fixes it. At this point we've
had the first 3 marked out by ceph because they flapped enough that systemd
gave up trying to restart them, they stayed down long enough and
mon_osd_down_out_interval expired. Now the pg map looks like this:
# ceph pg map 1.138
osdmap e73599 pg 1.138 (1.138) -> up
[111,1325,437,456,177,1094,194,1513,236,302,1326] acting
[2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,1326]
Seeing the #18162, it looks a lot like what we're seeing in our production
system (which is experiencing a service outage because of this) but the fact
that the issue is marked as minor severity and hasn't had any updates in two
months is disconcerting.
As for deep scrubbing it sounds like it could possibly work in a general
corruption situation but not with a PG stuck in down+remapped and it's first 3
OSDs crashing out after 5' of operation.
Thanks,
George
From: Paweł Woszuk [[email protected]]
Sent: 22 August 2017 19:19
To: [email protected]; Vasilakakos, George (STFC,RAL,SC)
Subject: Re: [ceph-users] OSDs in EC pool flapping
Have you experienced huge memory consumption by flapping OSD daemons? Restart
could be triggered by no memory (omkiller).
If yes,this could be connected with osd device error,(bad blocks?), but we've
experienced something similar on Jewel, not Kraken release. Solution was to
find PG that cause error, set it to deep scrub manually and restart PG's
primary OSD.
Hope that helps, or at least lead to some solution.
Dnia 22 sierpnia 2017 18:39:47 CEST, [email protected] napisał(a):
Hey folks,
I'm staring at a problem that I have found no solution for and which is causing
major issues.
We've had a PG go down with the first 3 OSDs all crashing and coming back only
to crash again with the following error in their logs:
-1> 2017-08-22 17:27:50.961633 7f4af4057700 -1 osd.1290 pg_epoch: 72946
pg[1.138s0( v 72946'430011 (62760'421568,72
946'430011] local-les=72945 n=22918 ec=764 les/c/f 72945/72881/0
72942/72944/72944) [1290,927,672,456,177,1094,194,1513
,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326] r=0
lpr=72944 pi=72880-72943/24 bft=1513(7) crt=
72946'430011 lcod 72889'430010 mlcod 72889'430010
active+undersized+degraded+remapped+backfilling] recover_replicas: ob
ject added to missing set for backfill, but is not in recovering, error!
0> 2017-08-22 17:27:50.965861 7f4af4057700 -1 *** Caught signal (Aborted)
**
in thread 7f4af4057700 thread_name:tp_osd_tp
This has been going on over the weekend when we saw a different error message
before upgrading from 11.2.0 to 11.2.1.
The pool is running EC 8+3.
The OSDs crash with that error only to be restarted by systemd and fail again
the exact same way. Eventually systemd gives, the mon_osd_down_out_interval
expires and the PG just stays down+remapped while other recover and go
active+clean.
Can anybody help with this type of problem?
Best regards,
George Vasilakakos
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Paweł Woszuk
PCSS, Poznańskie Centrum Superkomputerowo-Sieciowe
ul. Jana Pawła II nr 10, 61-139 Poznań
Polska
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com