Hit suicide timeout after adding new osd

Jens Kristian Søgaard Thu, 17 Jan 2013 06:36:20 -0800

Hi guys,

I had a functioning Ceph system that reported HEALTH_OK. It was runningwith 3 osds on 3 servers.

Then I added an extra osd on 1 of the servers using the commands fromthe documentation here:


http://ceph.com/docs/master/rados/operations/add-or-rm-osds/

Shortly after I did that 2 of the existing osds crashed.

I restarted them and after some hours they were up and running again,but soon one of them crashed again - and a third existing osd crashed aswell. I restarted those two and waited some hours for them to come up. Ashort while later one of them crashed again.

I have then restarted restarted that last one and watched the logsclosely. It seems the same patterns repeats itself every time. It startsup doing its normal maintenance before going "up" (takes a long while).Then it seems to be running, but logs the following every 5 seconds:

heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had timedout after 30


After some time it logs:

===================================================

heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had suicidetimed out after 300

2013-01-17 15:24:35.051524 7f053f149700 -1 common/HeartbeatMap.cc: Infunction 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,const char*, time_t)' thread 7f053f149700 time 2013-01-17 15:24:33.849654

common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)

1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,long)+0x2eb) [0x8462bb]

 2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x846a9e]
 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x846cc8]
 4: (CephContextServiceThread::entry()+0x55) [0x8e01c5]
 5: /lib64/libpthread.so.0() [0x360de07d14]
 6: (clone()+0x6d) [0x360d6f167d]

NOTE: a copy of the executable, or `objdump -rdS <executable>` isneeded to interpret this.


2013-01-17 15:24:35.301183 7f053f149700 -1 *** Caught signal (Aborted) **
 in thread 7f053f149700

 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: /usr/bin/ceph-osd() [0x82ea90]
 2: /lib64/libpthread.so.0() [0x360de0efe0]
 3: (gsignal()+0x35) [0x360d635925]
 4: (abort()+0x148) [0x360d6370d8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x3611660dad]

NOTE: a copy of the executable, or `objdump -rdS <executable>` isneeded to interpret this.

===================================================

How can I avoid this? - is it a bug, or have I done something wrong?

I'm running Ceph 0.56.1 from the official RPMs on Fedora 17.

The underlying disks and network connectivity has been tested andnothing seems to be wrong there.


Thanks in advance for your assistance!
--
Jens Kristian Søgaard, Mermaid Consulting ApS,
j...@mermaidconsulting.dk,
http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hit suicide timeout after adding new osd

Reply via email to