[ceph-users] Old CEPH (0.87) cluster degradation - putting OSDs down one by one

maxxik Fri, 26 Feb 2016 15:15:02 -0800

Hi Cephers

At the moment we are trying to recover our CEPH cluser (0.87) which is
behaving very odd.


What have been done :

1. OSD drive failure happened - CEPH put OSD down and  out.
2. Physical HDD replaced  and NOT added to CEPH - here we had strange
kernel crash just after HDD connected to the controller.
3. Physical host rebooted.
4. CEPH started restoration and putting OSD's down one by one (actually
I can see osd process crush in logs).

ceph.conf is in attachment.


OSD failure :

    -4> 2016-02-26 23:20:47.906443 7f942b4b6700  5 -- op tracker -- seq:
471061, time: 2016-02-26 23:20:47.906404, even
t: header_read, op: pg_backfill(progress 13.77 e 183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
    -3> 2016-02-26 23:20:47.906451 7f942b4b6700  5 -- op tracker -- seq:
471061, time: 2016-02-26 23:20:47.906406, even
t: throttled, op: pg_backfill(progress 13.77 e 183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
    -2> 2016-02-26 23:20:47.906456 7f942b4b6700  5 -- op tracker -- seq:
471061, time: 2016-02-26 23:20:47.906421, even
t: all_read, op: pg_backfill(progress 13.77 e 183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
    -1> 2016-02-26 23:20:47.906462 7f942b4b6700  5 -- op tracker -- seq:
471061, time: 0.000000, event: dispatched, op:
 pg_backfill(progress 13.77 e 183964/183964 lb
45e69877/rb.0.25e43.6b8b4567.000000002c3b/head//13)
     0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1 *** Caught signal
(Aborted) **
 in thread 7f9434e0f700

 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
 1: /usr/bin/ceph-osd() [0x9e2015]
 2: (()+0xfcb0) [0x7f945459fcb0]
 3: (gsignal()+0x35) [0x7f94533d30d5]
 4: (abort()+0x17b) [0x7f94533d683b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9453d2569d]
 6: (()+0xb5846) [0x7f9453d23846]
 7: (()+0xb5873) [0x7f9453d23873]
 8: (()+0xb596e) [0x7f9453d2396e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x259) [0xacb979]
 10: (SnapSet::get_clone_bytes(snapid_t) const+0x15f) [0x732c0f]
 11: (ReplicatedPG::_scrub(ScrubMap&)+0x10c4) [0x7f5e54]
 12: (PG::scrub_compare_maps()+0xcb6) [0x7876e6]
 13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1c3) [0x7880b3]
 14: (PG::scrub(ThreadPool::TPHandle&)+0x33d) [0x789abd]
 15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x67ccf3]
 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xabb3ce]
 17: (ThreadPool::WorkThread::entry()+0x10) [0xabe160]
 18: (()+0x7e9a) [0x7f9454597e9a]
 19: (clone()+0x6d) [0x7f94534912ed]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
  -1/-1 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.27.log


Current OSD tree:


# id    weight  type name       up/down reweight
-10     2       root ssdtree
-8      1               host ibstorage01-ssd1
9       1                       osd.9   up      1
-9      1               host ibstorage02-ssd1
10      1                       osd.10  up      1
-1      22.99   root default
-7      22.99           room cdsqv1
-3      22.99                   rack gopc-rack01
-2      8                               host ibstorage01-sas1
0       1                                       osd.0   down    0
1       1                                       osd.1   up      1
2       1                                       osd.2   up      1
3       1                                       osd.3   down    0
7       1                                       osd.7   up      1
4       1                                       osd.4   up      1
5       1                                       osd.5   up      1
6       1                                       osd.6   up      1
-4      6.99                            host ibstorage02-sas1
20      1                                       osd.20  down    0
21      1.03                                    osd.21  up      1
22      0.96                                    osd.22  down    0
25      1                                       osd.25  down    0
26      1                                       osd.26  up      1
27      1                                       osd.27  down    0
8       1                                       osd.8   up      1
-11     8                               host ibstorage03-sas1
11      1                                       osd.11  up      1
12      1                                       osd.12  up      1
13      1                                       osd.13  up      1
14      1                                       osd.14  up      1
15      1                                       osd.15  up      1
16      1                                       osd.16  up      1
17      1                                       osd.17  down    0
18      1                                       osd.18  up      1

the affected OSD was osd.23 on the host "ibstorage02-sa1" -- deleted now.


Any thoughts/ thing to check additionally ?

Thanks !

[global]

        # For version 0.55 and beyond, you must explicitly enable
        # or disable authentication with "auth" entries in [global].

        auth cluster required = none
        auth service required = none
        auth client required = none

        osd pool default size = 2
        osd pool default min size = 1

        osd recovery max active = 1

        osd deep scrub interval = 1814400
        journal queue max ops = 1000
        journal queue max bytes = 104857600

        public network = 10.10.0.0/24
        cluster network = 10.11.0.0/24

        err to syslog = true

[mon]
        mon cluster log to syslog = true

[osd]
        osd journal size = 1700

        osd mkfs type = "ext4"

    osd mkfs options ext4 = user_xattr,rw,noatime

        osd data = /srv/ceph/osd$id
        osd journal = /srv/journal/osd$id/journal

        osd crush update on start = false


[mon.1]
        host = ibstorage01
        mon addr = 10.10.0.48:6789
        mon data = /srv/mondata

[mon.2]
        host = ibstorage02
        mon addr = 10.10.0.49:6789
        mon data = /srv/mondata

[mon.3]
        host = ibstorage03
        mon addr = 10.10.0.50:6789
        mon data = /srv/mondata


[mon.3]
        host = ibstorage03
        mon addr = 10.10.0.50:6789
        mon data = /srv/mondata


[osd.0]
        host = ibstorage01
        public addr = 10.10.0.48
        cluster addr = 10.11.0.48
        osd crush location = host=ibstorage01-sas1

[osd.1]
        host = ibstorage01
        public addr = 10.10.0.48
        cluster addr = 10.11.0.48
        osd crush location = host=ibstorage01-sas1

[osd.2]
        host = ibstorage01
        public addr = 10.10.0.48
        cluster addr = 10.11.0.48
        osd crush location = host=ibstorage01-sas1

[osd.3]
        host = ibstorage01
        public addr = 10.10.0.48
        cluster addr = 10.11.0.48
        osd crush location = host=ibstorage01-sas1

[osd.4]
        host = ibstorage01
        public addr = 10.10.0.48
        cluster addr = 10.11.0.48
        osd crush location = host=ibstorage01-sas2

[osd.5]
        host = ibstorage01
        public addr = 10.10.0.48
        cluster addr = 10.11.0.48
        osd crush location = host=ibstorage01-sas2

[osd.6]
        host = ibstorage01
        public addr = 10.10.0.48
        cluster addr = 10.11.0.48
        osd crush location = host=ibstorage01-sas2

[osd.7]
        host = ibstorage01
        public addr = 10.10.0.48
        cluster addr = 10.11.0.48
        osd crush location = host=ibstorage01-sas2

[osd.9]
        host = ibstorage01
        public addr = 10.10.0.48
        cluster addr = 10.11.0.48
        osd crush location = host=ibstorage01-ssd1
        osd journal size = 10000
        osd data = /srv/ceph/osd9
        osd journal = /srv/ceph/osd9/journal

[osd.10]
        host = ibstorage02
        public addr = 10.10.0.49
        cluster addr = 10.11.0.49
        osd crush location = host=ibstorage02-ssd1
        osd journal size = 10000
        osd data = /srv/ceph/osd10
        osd journal = /srv/ceph/osd10/journal

[osd.11]
        host = ibstorage03
        public addr = 10.10.0.50
        cluster addr = 10.11.0.50
        osd crush location = host=ibstorage03-sas1

[osd.12]
        host = ibstorage03
        public addr = 10.10.0.50
        cluster addr = 10.11.0.50
        osd crush location = host=ibstorage03-sas1

[osd.20]
        host = ibstorage02
        public addr = 10.10.0.49
        cluster addr = 10.11.0.49
        osd crush location = host=ibstorage02-sas1

[osd.21]
        host = ibstorage02
        public addr = 10.10.0.49
        cluster addr = 10.11.0.49
        osd crush location = host=ibstorage02-sas1

[osd.22]
        host = ibstorage02
        public addr = 10.10.0.49
        cluster addr = 10.11.0.49
        osd crush location = host=ibstorage02-sas1

[osd.23]
        host = ibstorage02
        public addr = 10.10.0.49
        cluster addr = 10.11.0.49
        osd crush location = host=ibstorage02-sas1

[osd.8]
        host = ibstorage02
        public addr = 10.10.0.49
        cluster addr = 10.11.0.49
        osd crush location = host=ibstorage02-sas1

[osd.25]
        host = ibstorage02
        public addr = 10.10.0.49
        cluster addr = 10.11.0.49
        osd crush location = host=ibstorage02-sas1

[osd.26]
        host = ibstorage02
        public addr = 10.10.0.49
        cluster addr = 10.11.0.49
        osd crush location = host=ibstorage02-sas1

[osd.27]
        host = ibstorage02
        public addr = 10.10.0.49
        cluster addr = 10.11.0.49
        osd crush location = host=ibstorage02-sas1

signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Old CEPH (0.87) cluster degradation - putting OSDs down one by one

Reply via email to