Hi Dilip, Looking at the output of ceph -s it's still recovering (there are still pgs in recovery_wait, backfill_wait, recovering state) so you will have to be patient to let ceph recover.
The output of ceph osd dump doesn't mention osd.7 (it's referring to pool 7) Kind regards, Caspar Smit 2018-04-18 11:10 GMT+02:00 Dilip Renkila <[email protected]>: > Hi all, > > We recently had an osd breakdown. After that i have manually added osd's > thinking that ceph repairs by itself. > > I am running ceph 11 version > > root@node16:~# ceph -v > ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7) > > root@node16:~# ceph -s > cluster 7c75f6e9-b858-4ac4-aa26-48ae1f33eda2 > health HEALTH_WARN > 371 pgs backfill_wait > 372 pgs degraded > 1 pgs recovering > 3 pgs recovery_wait > 372 pgs stuck degraded > 375 pgs stuck unclean > 372 pgs stuck undersized > 372 pgs undersized > 2 requests are blocked > 32 sec > recovery 95173/453987 objects degraded (20.964%) > recovery 103542/453987 objects misplaced (22.807%) > recovery 1/149832 unfound (0.001%) > pool cinder-volumes pg_num 300 > pgp_num 128 > pool ephemeral-vms pg_num 300 > pgp_num 128 > 1 mons down, quorum 0,1 node15,node16 > monmap e2: 3 mons at {node15=10.0.5.15:6789/0, > node16=10.0.5.16:6789/0,node17=10.0.5.17:6789/0} > election epoch 1226, quorum 0,1 node15,node16 > mgr active: node16 > osdmap e7858: 6 osds: 6 up, 6 in; 375 remapped pgs > flags sortbitwise,require_jewel_osds,require_kraken_osds > pgmap v16570651: 600 pgs, 2 pools, 571 GB data, 146 kobjects > 1363 GB used, 4202 GB / 5566 GB avail > 95173/453987 objects degraded (20.964%) > 103542/453987 objects misplaced (22.807%) > 1/149832 unfound (0.001%) > 368 active+undersized+degraded+remapped+backfill_wait > 225 active+clean > 3 active+remapped+backfill_wait > 3 active+recovery_wait+undersized+degraded+remapped > 1 active+recovering+undersized+degraded+remapped > client io 17441 B/s rd, 271 kB/s wr, 42 op/s rd, 26 op/s wr > > Many pgs are stuck degraded, remapped ..etc. > > > root@node16:~# ceph osd tree > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 5.81839 root default > -2 1.81839 host node9 > 11 0.90919 osd.11 up 1.00000 1.00000 > 1 0.90919 osd.1 up 1.00000 1.00000 > -3 2.00000 host node10 > 0 1.00000 osd.0 up 1.00000 1.00000 > 2 1.00000 osd.2 up 1.00000 1.00000 > -4 2.00000 host node8 > 3 1.00000 osd.3 up 1.00000 1.00000 > 6 1.00000 osd.6 up 1.00000 1.00000 > > > I have attached the output of ceph osd dump > <https://pastebin.com/TznUZVFz>. Interstingly you can see pg_temp . > What does that means and why osd 7 is involved there? > > here is the crush map > > root@node16:~# cat /tmp/crush.txt > # begin crush map > tunable choose_local_tries 0 > tunable choose_local_fallback_tries 0 > tunable choose_total_tries 50 > tunable chooseleaf_descend_once 1 > tunable chooseleaf_vary_r 1 > tunable straw_calc_version 1 > > # devices > device 0 osd.0 > device 1 osd.1 > device 2 osd.2 > device 3 osd.3 > device 4 device4 > device 5 device5 > device 6 osd.6 > device 7 device7 > device 8 device8 > device 9 device9 > device 10 device10 > device 11 osd.11 > > # types > type 0 osd > type 1 host > type 2 chassis > type 3 rack > type 4 row > type 5 pdu > type 6 pod > type 7 room > type 8 datacenter > type 9 region > type 10 root > > # buckets > host node9 { > id -2 # do not change unnecessarily > # weight 1.818 > alg straw > hash 0 # rjenkins1 > item osd.11 weight 0.909 > item osd.1 weight 0.909 > } > host node10 { > id -3 # do not change unnecessarily > # weight 2.000 > alg straw > hash 0 # rjenkins1 > item osd.0 weight 1.000 > item osd.2 weight 1.000 > } > host node8 { > id -4 # do not change unnecessarily > # weight 2.000 > alg straw > hash 0 # rjenkins1 > item osd.3 weight 1.000 > item osd.6 weight 1.000 > } > root default { > id -1 # do not change unnecessarily > # weight 5.818 > alg straw > hash 0 # rjenkins1 > item node9 weight 1.818 > item node10 weight 2.000 > item node8 weight 2.000 > } > > # rules > rule replicated_ruleset { > ruleset 0 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type host > step emit > } > > # end crush map > > > > But the intersting thing is i am seeing the following line on all osd logs > > > 2018-04-18 10:57:23.437006 7f883a14b700 0 -- 10.0.5.10:6802/25296 >> - > conn(0x55f90cf8f000 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2018-04-18 10:57:26.715861 7f883a14b700 0 -- 10.0.5.10:6802/25296 >> - > conn(0x55f90cf8f000 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2018-04-18 10:57:38.435193 7f883a14b700 0 -- 10.0.5.10:6802/25296 >> - > conn(0x55f90d3d4800 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > 2018-04-18 10:57:41.717710 7f883a14b700 0 -- 10.0.5.10:6802/25296 >> - > conn(0x55f8e2944800 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 > l=0).fault with nothing to send and in the half accept state just closed > > > What does this means > > > > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
