Hi Cephers At the moment we are trying to recover our CEPH cluser (0.87) which is behaving very odd.
What have been done :
1. OSD drive failure happened - CEPH put OSD down and out.
2. Physical HDD replaced and NOT added to CEPH - here we had strange
kernel crash just after HDD connected to the controller.
3. Physical host rebooted.
4. CEPH started restoration and putting OSD's down one by one (actually
I can see osd process crush in logs).
ceph.conf is in attachment.
OSD failure :
-4> 2016-02-26 23:20:47.906443 7f942b4b6700 5 -- op tracker -- seq:
471061, time: 2016-02-26 23:20:47.906404, even
t: header_read, op: pg_backfill(progress 13.77 e 183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
-3> 2016-02-26 23:20:47.906451 7f942b4b6700 5 -- op tracker -- seq:
471061, time: 2016-02-26 23:20:47.906406, even
t: throttled, op: pg_backfill(progress 13.77 e 183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
-2> 2016-02-26 23:20:47.906456 7f942b4b6700 5 -- op tracker -- seq:
471061, time: 2016-02-26 23:20:47.906421, even
t: all_read, op: pg_backfill(progress 13.77 e 183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
-1> 2016-02-26 23:20:47.906462 7f942b4b6700 5 -- op tracker -- seq:
471061, time: 0.000000, event: dispatched, op:
pg_backfill(progress 13.77 e 183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1 *** Caught signal
(Aborted) **
in thread 7f9434e0f700
ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
1: /usr/bin/ceph-osd() [0x9e2015]
2: (()+0xfcb0) [0x7f945459fcb0]
3: (gsignal()+0x35) [0x7f94533d30d5]
4: (abort()+0x17b) [0x7f94533d683b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9453d2569d]
6: (()+0xb5846) [0x7f9453d23846]
7: (()+0xb5873) [0x7f9453d23873]
8: (()+0xb596e) [0x7f9453d2396e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x259) [0xacb979]
10: (SnapSet::get_clone_bytes(snapid_t) const+0x15f) [0x732c0f]
11: (ReplicatedPG::_scrub(ScrubMap&)+0x10c4) [0x7f5e54]
12: (PG::scrub_compare_maps()+0xcb6) [0x7876e6]
13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1c3) [0x7880b3]
14: (PG::scrub(ThreadPool::TPHandle&)+0x33d) [0x789abd]
15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x67ccf3]
16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xabb3ce]
17: (ThreadPool::WorkThread::entry()+0x10) [0xabe160]
18: (()+0x7e9a) [0x7f9454597e9a]
19: (clone()+0x6d) [0x7f94534912ed]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
-1/-1 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.27.log
Current OSD tree:
# id weight type name up/down reweight
-10 2 root ssdtree
-8 1 host ibstorage01-ssd1
9 1 osd.9 up 1
-9 1 host ibstorage02-ssd1
10 1 osd.10 up 1
-1 22.99 root default
-7 22.99 room cdsqv1
-3 22.99 rack gopc-rack01
-2 8 host ibstorage01-sas1
0 1 osd.0 down 0
1 1 osd.1 up 1
2 1 osd.2 up 1
3 1 osd.3 down 0
7 1 osd.7 up 1
4 1 osd.4 up 1
5 1 osd.5 up 1
6 1 osd.6 up 1
-4 6.99 host ibstorage02-sas1
20 1 osd.20 down 0
21 1.03 osd.21 up 1
22 0.96 osd.22 down 0
25 1 osd.25 down 0
26 1 osd.26 up 1
27 1 osd.27 down 0
8 1 osd.8 up 1
-11 8 host ibstorage03-sas1
11 1 osd.11 up 1
12 1 osd.12 up 1
13 1 osd.13 up 1
14 1 osd.14 up 1
15 1 osd.15 up 1
16 1 osd.16 up 1
17 1 osd.17 down 0
18 1 osd.18 up 1
the affected OSD was osd.23 on the host "ibstorage02-sa1" -- deleted now.
Any thoughts/ thing to check additionally ?
Thanks !
[global]
# For version 0.55 and beyond, you must explicitly enable
# or disable authentication with "auth" entries in [global].
auth cluster required = none
auth service required = none
auth client required = none
osd pool default size = 2
osd pool default min size = 1
osd recovery max active = 1
osd deep scrub interval = 1814400
journal queue max ops = 1000
journal queue max bytes = 104857600
public network = 10.10.0.0/24
cluster network = 10.11.0.0/24
err to syslog = true
[mon]
mon cluster log to syslog = true
[osd]
osd journal size = 1700
osd mkfs type = "ext4"
osd mkfs options ext4 = user_xattr,rw,noatime
osd data = /srv/ceph/osd$id
osd journal = /srv/journal/osd$id/journal
osd crush update on start = false
[mon.1]
host = ibstorage01
mon addr = 10.10.0.48:6789
mon data = /srv/mondata
[mon.2]
host = ibstorage02
mon addr = 10.10.0.49:6789
mon data = /srv/mondata
[mon.3]
host = ibstorage03
mon addr = 10.10.0.50:6789
mon data = /srv/mondata
[mon.3]
host = ibstorage03
mon addr = 10.10.0.50:6789
mon data = /srv/mondata
[osd.0]
host = ibstorage01
public addr = 10.10.0.48
cluster addr = 10.11.0.48
osd crush location = host=ibstorage01-sas1
[osd.1]
host = ibstorage01
public addr = 10.10.0.48
cluster addr = 10.11.0.48
osd crush location = host=ibstorage01-sas1
[osd.2]
host = ibstorage01
public addr = 10.10.0.48
cluster addr = 10.11.0.48
osd crush location = host=ibstorage01-sas1
[osd.3]
host = ibstorage01
public addr = 10.10.0.48
cluster addr = 10.11.0.48
osd crush location = host=ibstorage01-sas1
[osd.4]
host = ibstorage01
public addr = 10.10.0.48
cluster addr = 10.11.0.48
osd crush location = host=ibstorage01-sas2
[osd.5]
host = ibstorage01
public addr = 10.10.0.48
cluster addr = 10.11.0.48
osd crush location = host=ibstorage01-sas2
[osd.6]
host = ibstorage01
public addr = 10.10.0.48
cluster addr = 10.11.0.48
osd crush location = host=ibstorage01-sas2
[osd.7]
host = ibstorage01
public addr = 10.10.0.48
cluster addr = 10.11.0.48
osd crush location = host=ibstorage01-sas2
[osd.9]
host = ibstorage01
public addr = 10.10.0.48
cluster addr = 10.11.0.48
osd crush location = host=ibstorage01-ssd1
osd journal size = 10000
osd data = /srv/ceph/osd9
osd journal = /srv/ceph/osd9/journal
[osd.10]
host = ibstorage02
public addr = 10.10.0.49
cluster addr = 10.11.0.49
osd crush location = host=ibstorage02-ssd1
osd journal size = 10000
osd data = /srv/ceph/osd10
osd journal = /srv/ceph/osd10/journal
[osd.11]
host = ibstorage03
public addr = 10.10.0.50
cluster addr = 10.11.0.50
osd crush location = host=ibstorage03-sas1
[osd.12]
host = ibstorage03
public addr = 10.10.0.50
cluster addr = 10.11.0.50
osd crush location = host=ibstorage03-sas1
[osd.20]
host = ibstorage02
public addr = 10.10.0.49
cluster addr = 10.11.0.49
osd crush location = host=ibstorage02-sas1
[osd.21]
host = ibstorage02
public addr = 10.10.0.49
cluster addr = 10.11.0.49
osd crush location = host=ibstorage02-sas1
[osd.22]
host = ibstorage02
public addr = 10.10.0.49
cluster addr = 10.11.0.49
osd crush location = host=ibstorage02-sas1
[osd.23]
host = ibstorage02
public addr = 10.10.0.49
cluster addr = 10.11.0.49
osd crush location = host=ibstorage02-sas1
[osd.8]
host = ibstorage02
public addr = 10.10.0.49
cluster addr = 10.11.0.49
osd crush location = host=ibstorage02-sas1
[osd.25]
host = ibstorage02
public addr = 10.10.0.49
cluster addr = 10.11.0.49
osd crush location = host=ibstorage02-sas1
[osd.26]
host = ibstorage02
public addr = 10.10.0.49
cluster addr = 10.11.0.49
osd crush location = host=ibstorage02-sas1
[osd.27]
host = ibstorage02
public addr = 10.10.0.49
cluster addr = 10.11.0.49
osd crush location = host=ibstorage02-sas1
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
