Hi all,
yesterday we encountered a problem within our ceph cluster. After a long day we
were able to fix it, but we are very unsatisfied with the fix and assume that
it is also only temporary. Any hint or help is very appreciated.
We have a production ceph cluster (ceph version 10.2.5 on ubuntu 16.04,
deployed with ceph-ansible) serving our private OpenStack cloud with object-
and blockstorage. We went live a few days ago with a few friendly users /
projects on it. Yesterday we woke up to a few alarms about failed VM creations
and a few down OSDs. The troubleshooting took us the whole day and this is what
we have found:
Starting situation:
Some OSDs were marked down, and we had 100 blocked requests on one OSD (+ a few
on other OSDs), ceph was recovering and backfilling. We also saw OSDs rejoining
the cluster from time to time, others left it as 'down'. Causing recovery
traffic.
Analysis:
- The OSD process with 100 blocked request was using significantly more CPU
than the other OSDs on the node. Lets call this OSD osd.113 (although this
changed later on). Most of the time it was like >120% - it looked like one
thread was using its CPU up to 100%.
- In the logs of the OSDs that recently left the cluster as down we found
several of the following entries:
2017-03-01 10:22:29.258982 7f840d258700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
2017-03-01 10:22:29.258988 7f840d157700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
2017-03-01 10:22:31.080447 7f8447ea6700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
2017-03-01 10:22:32.759999 7f840d258700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
2017-03-01 10:22:32.760010 7f840d157700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
2017-03-01 10:22:36.080545 7f8447ea6700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f842a0bf700' had timed out after 15
2017-03-01 10:22:36.080568 7f8447ea6700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f842a0bf700' had suicide timed out after 150
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x80) [0x55c490e24520]
2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
const*, long)+0x259) [0x55c490d61209]
3: (ceph::HeartbeatMap::is_healthy()+0xe6) [0x55c490d61b36]
4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x55c490d6238c]
5: (CephContextServiceThread::entry()+0x167) [0x55c490e3cda7]
6: (()+0x76ba) [0x7f4f028096ba]
7: (clone()+0x6d) [0x7f4f0088282d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
On the monitor we also saw several of those entries for different OSDs:
2017-03-01 08:34:23.321262 osd.77 10.97.162.16:6843/61369 62043 : cluster
[WRN] map e357 wrongly marked me down
Something was preventing OSDs from properly sending or answering heartbeats...
- We then tried to increase the osd_heartbeat_grace parameter with 'ceph tell
osd --injectargs' to break the chain of leaving and rejoining OSDs. We were
able to set it on all OSDs except osd.113 (the one using >120% CPU and having
100 blocked request), because it did not answer. What we got was a bit more
stability. The cluster recovered to an almost stable state:
# ceph pg dump_stuck | column -t
ok
pg_stat state up up_primary
acting acting_primary
4.33 active+recovering+degraded+remapped [113,104,115] 113
[113,77,87] 113
8.7 active+recovery_wait+degraded [113,75,94] 113
[113,75,94] 113
3.40 active+recovery_wait+degraded+remapped [113,92,78] 113
[113,116,84] 113
Although it was saying ' 4.33 active+recovering' there was no actual
recovery traffic visible in 'ceph -s’ over quiet some time. We tried 'ceph pg
4.33 query' but the command did not return.
- As we now had a clear indication that somethong was wrong with osd.113 we
decided to take it down and let ceph recover. As soon as we took osd.113 down
its PGs got a new primary OSD and ... the CPU usage >120% and the 100 blocked
request disappeared from osd.113, only to reappear on the new primary node for
PG 4.33. Also we now had a lot of additional recovery traffic, because osd.113
was taken down and the new primary OSD also went down with a suicide timout (as
seen above), switching the primary to a new OSD, *leaving us exactly with our
starting situation*, just with a different primary for the 'defect PG' 4.33. In
this vicious circle the new primary OSDs also died from heartbeat_map timeouts
from time to time, causing PG 4.33 to 'wander' through our OSDs, leaving behind
a mess of down, rejoining and recovering OSDs/PGs on its trail.
(Temporary) Solution:
The pool with index number 4, that PG 4.33 belonged to, was an
openstack-gnocchi pool containing only telemetry data. As this was not
essential for our operations we decided to delete the complete pool to get rid
of PG 4.33 and recreate it. This is clearly 'cracking a nut with a
sledgehammer'. However, after PG 4.33 was gone the cluster was able to fully
recover and remained stable since then. If the pool would have contained volume
or object storage data objects this would not have been an option...
Any hint for a better solution (in case the problem reappears) is highly
appreciated.
Daniel Marks | Cloud Engineer
codecentric AG | Hochstraße 11 | 42697 Solingen | Deutschland
www.codecentric.de | blog.codecentric.de | www.meettheexperts.de
Sitz der Gesellschaft: Solingen | HRB 25917 | Amtsgericht Wuppertal
Vorstand: Michael Hochgürtel . Rainer Vehns
Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz
Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche
und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige
Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie
bitte sofort den Absender und löschen Sie diese E-Mail und evtl. beigefügter
Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen evtl. beigefügter
Dateien sowie die unbefugte Weitergabe dieser E-Mail ist nicht gestattet.
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com