No, nothing like that.

The cluster is in the process of having more OSDs added and, while that was 
ongoing, one was removed because the underlying disk was throwing up a bunch of 
read errors.
Shortly after, the first three OSDs in this PG started crashing with error 
messages about corrupted EC shards. We seemed to be running into 
http://tracker.ceph.com/issues/18624 so we moved on to 11.2.1 which essentially 
means they now fail with a different error message. Our problem looks a bit 
like this: http://tracker.ceph.com/issues/18162

For a bit more context here's two more events going backwards in the dump:


    -3> 2017-08-22 17:42:09.443216 7fa2e283d700  0 osd.1290 pg_epoch: 73324 
pg[1.138s0( v 73085'430014 (62760'421568,73
085'430014] local-les=73323 n=22919 ec=764 les/c/f 73323/72881/0 
73321/73322/73322) [1290,927,672,456,177,1094,194,1513
,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326] r=0 
lpr=73322 pi=72880-73321/179 rops=1 bft=1513
(7) crt=73085'430014 lcod 0'0 mlcod 0'0 
active+undersized+degraded+remapped+backfilling] failed_push 1:1c959fdd:::datad
isk%2frucio%2fmc16_13TeV%2f41%2f30%2fAOD.11927271._003020.pool.root.1.0000000000000000:head
 from shard 177(4), reps on 
 unfound? 0
    -2> 2017-08-22 17:42:09.443299 7fa2e283d700  5 -- op tracker -- seq: 490, 
time: 2017-08-22 17:42:09.443297, event: 
done, op: MOSDECSubOpReadReply(1.138s0 73324 ECSubReadReply(tid=5, 
attrs_read=0))

No amount of taking OSDs out or restarting them fixes it. At this point we've 
had the first 3 marked out by ceph because they flapped enough that systemd 
gave up trying to restart them, they stayed down long enough and 
mon_osd_down_out_interval expired. Now the pg map looks like this:

# ceph pg map 1.138
osdmap e73599 pg 1.138 (1.138) -> up 
[111,1325,437,456,177,1094,194,1513,236,302,1326] acting 
[2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,1326]

Seeing the #18162, it looks a lot like what we're seeing in our production 
system (which is experiencing a service outage because of this) but the fact 
that the issue is marked as minor severity and hasn't had any updates in two 
months is disconcerting.

As for deep scrubbing it sounds like it could possibly work in a general 
corruption situation but not with a PG stuck in down+remapped and it's first 3 
OSDs crashing out after 5' of operation.


Thanks, 

George



From: Paweł Woszuk [[email protected]]

Sent: 22 August 2017 19:19

To: [email protected]; Vasilakakos, George (STFC,RAL,SC)

Subject: Re: [ceph-users] OSDs in EC pool flapping





Have you experienced huge memory consumption by flapping OSD daemons? Restart 
could be triggered by no memory (omkiller).



If yes,this could be connected with osd device error,(bad blocks?), but we've 
experienced something similar on Jewel, not Kraken release. Solution was to 
find PG that cause error, set it to deep scrub manually and restart PG's 
primary OSD.



Hope that helps, or at least lead to some solution.



Dnia 22 sierpnia 2017 18:39:47 CEST, [email protected] napisał(a):

Hey folks,


I'm staring at a problem that I have found no solution for and which is causing 
major issues.
We've had a PG go down with the first 3 OSDs all crashing and coming back only 
to crash again with the following error in their logs:

    -1> 2017-08-22 17:27:50.961633 7f4af4057700 -1 osd.1290 pg_epoch: 72946 
pg[1.138s0( v 72946'430011 (62760'421568,72
946'430011] local-les=72945 n=22918 ec=764 les/c/f 72945/72881/0 
72942/72944/72944) [1290,927,672,456,177,1094,194,1513
,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326] r=0 
lpr=72944 pi=72880-72943/24 bft=1513(7) crt=
72946'430011 lcod 72889'430010 mlcod 72889'430010 
active+undersized+degraded+remapped+backfilling] recover_replicas: ob
ject added to missing set for backfill, but is not in recovering, error!
     0> 2017-08-22 17:27:50.965861 7f4af4057700 -1 *** Caught signal (Aborted) 
**
 in thread 7f4af4057700 thread_name:tp_osd_tp

This has been going on over the weekend when we saw a different error message 
before upgrading from 11.2.0 to 11.2.1.
The pool is running EC 8+3.

The OSDs crash with that error only to be restarted by systemd and fail again 
the exact same way. Eventually systemd gives, the mon_osd_down_out_interval 
expires and the PG just stays down+remapped while other recover and go 
active+clean.

Can anybody help with this type of problem?


Best regards,

George Vasilakakos

ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





Paweł Woszuk

PCSS, Poznańskie Centrum Superkomputerowo-Sieciowe

ul. Jana Pawła II nr 10, 61-139 Poznań

Polska


_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to