[ceph-users] 'defect PG' caused heartbeat_map is_healthy timeout and recurring OSD breakdowns

Daniel Marks Thu, 02 Mar 2017 01:37:48 -0800

Hi all,

yesterday we encountered a problem within our ceph cluster. After a long day we 
were able to fix it, but we are very unsatisfied with the fix and assume that 
it is also only temporary. Any hint or help is very appreciated.


We have a production ceph cluster (ceph version 10.2.5 on ubuntu 16.04, 
deployed with ceph-ansible) serving our private OpenStack cloud with object- 
and blockstorage. We went live a few days ago with a few friendly users / 
projects on it. Yesterday we woke up to a few alarms about failed VM creations 
and a few down OSDs. The troubleshooting took us the whole day and this is what 
we have found:

Starting situation:
Some OSDs were marked down, and we had 100 blocked requests on one OSD (+ a few 
on other OSDs), ceph was recovering and backfilling. We also saw OSDs rejoining 
the cluster from time to time, others left it as 'down'. Causing recovery 
traffic.

Analysis:
- The OSD process with 100 blocked request was using significantly more CPU 
than the other OSDs on the node. Lets call this OSD osd.113 (although this 
changed later on). Most of the time it was like >120% - it looked like one 
thread was using its CPU up to 100%.

- In the logs of the OSDs that recently left the cluster as down we found 
several of the following entries:

    2017-03-01 10:22:29.258982 7f840d258700 1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
    2017-03-01 10:22:29.258988 7f840d157700 1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
    2017-03-01 10:22:31.080447 7f8447ea6700 1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
    2017-03-01 10:22:32.759999 7f840d258700 1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
    2017-03-01 10:22:32.760010 7f840d157700 1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
    2017-03-01 10:22:36.080545 7f8447ea6700 1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
    2017-03-01 10:22:36.080568 7f8447ea6700 1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f842a0bf700' had suicide timed out after 150

    ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
    1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55c490e24520]
    2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char 
const*, long)+0x259) [0x55c490d61209]
    3: (ceph::HeartbeatMap::is_healthy()+0xe6) [0x55c490d61b36]
    4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x55c490d6238c]
    5: (CephContextServiceThread::entry()+0x167) [0x55c490e3cda7]
    6: (()+0x76ba) [0x7f4f028096ba]
    7: (clone()+0x6d) [0x7f4f0088282d]
    NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.

  On the monitor we also saw several of those entries for different OSDs:

    2017-03-01 08:34:23.321262 osd.77 10.97.162.16:6843/61369 62043 : cluster 
[WRN] map e357 wrongly marked me down

  Something was preventing OSDs from properly sending or answering heartbeats...

- We then tried to increase the osd_heartbeat_grace parameter with 'ceph tell 
osd --injectargs' to break the chain of leaving and rejoining OSDs. We were 
able to set it on all OSDs except osd.113 (the one using >120% CPU and having 
100 blocked request), because it did not answer. What we got was a bit more 
stability. The cluster recovered to an almost stable state:

    # ceph pg dump_stuck | column -t
    ok
    pg_stat  state                                   up             up_primary  
acting        acting_primary
    4.33     active+recovering+degraded+remapped     [113,104,115]  113         
[113,77,87]   113
    8.7      active+recovery_wait+degraded           [113,75,94]    113         
[113,75,94]   113
    3.40     active+recovery_wait+degraded+remapped  [113,92,78]    113         
[113,116,84]  113

  Although it was saying ' 4.33     active+recovering' there was no actual 
recovery traffic visible in 'ceph -s’ over quiet some time. We tried 'ceph pg 
4.33 query' but the command did not return.

- As we now had a clear indication that somethong was wrong with osd.113 we 
decided to take it down and let ceph recover. As soon as we took osd.113 down 
its PGs got a new primary OSD and ... the CPU usage >120% and the 100 blocked 
request disappeared from osd.113, only to reappear on the new primary node for 
PG 4.33. Also we now had a lot of additional recovery traffic, because osd.113 
was taken down and the new primary OSD also went down with a suicide timout (as 
seen above), switching the primary to a new OSD, *leaving us exactly with our 
starting situation*, just with a different primary for the 'defect PG' 4.33. In 
this vicious circle the new primary OSDs also died from heartbeat_map timeouts 
from time to time, causing PG 4.33 to 'wander' through our OSDs, leaving behind 
a mess of down, rejoining and recovering OSDs/PGs on its trail.

(Temporary) Solution:
The pool with index number 4, that PG 4.33 belonged to, was an 
openstack-gnocchi pool containing only telemetry data. As this was not 
essential for our operations we decided to delete the complete pool to get rid 
of PG 4.33 and recreate it. This is clearly 'cracking a nut with a 
sledgehammer'. However, after PG 4.33 was gone the cluster was able to fully 
recover and remained stable since then. If the pool would have contained volume 
or object storage data objects this would not have been an option...

Any hint for a better solution (in case the problem reappears) is highly 
appreciated.

Daniel Marks | Cloud Engineer

codecentric AG | Hochstraße 11 | 42697 Solingen | Deutschland
www.codecentric.de | blog.codecentric.de | www.meettheexperts.de  

Sitz der Gesellschaft: Solingen | HRB 25917 | Amtsgericht Wuppertal
Vorstand: Michael Hochgürtel . Rainer Vehns
Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz

Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche 
und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige 
Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie 
bitte sofort den Absender und löschen Sie diese E-Mail und evtl. beigefügter 
Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen evtl. beigefügter 
Dateien sowie die unbefugte Weitergabe dieser E-Mail ist nicht gestattet.

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] 'defect PG' caused heartbeat_map is_healthy timeout and recurring OSD breakdowns

Reply via email to