Hi Dilip,

Looking at the output of ceph -s it's still recovering (there are still pgs
in recovery_wait, backfill_wait, recovering state) so you will have to be
patient to let ceph recover.

The output of ceph osd dump doesn't mention osd.7 (it's referring to pool 7)

Kind regards,
Caspar Smit

2018-04-18 11:10 GMT+02:00 Dilip Renkila <[email protected]>:

> Hi all,
>
> We recently had an osd breakdown. After that i have manually added osd's
> thinking that ceph repairs by itself.
>
> I am running ceph 11 version
>
> root@node16:~# ceph -v
> ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)
>
> root@node16:~# ceph -s
>     cluster 7c75f6e9-b858-4ac4-aa26-48ae1f33eda2
>      health HEALTH_WARN
>             371 pgs backfill_wait
>             372 pgs degraded
>             1 pgs recovering
>             3 pgs recovery_wait
>             372 pgs stuck degraded
>             375 pgs stuck unclean
>             372 pgs stuck undersized
>             372 pgs undersized
>             2 requests are blocked > 32 sec
>             recovery 95173/453987 objects degraded (20.964%)
>             recovery 103542/453987 objects misplaced (22.807%)
>             recovery 1/149832 unfound (0.001%)
>             pool cinder-volumes pg_num 300 > pgp_num 128
>             pool ephemeral-vms pg_num 300 > pgp_num 128
>             1 mons down, quorum 0,1 node15,node16
>      monmap e2: 3 mons at {node15=10.0.5.15:6789/0,
> node16=10.0.5.16:6789/0,node17=10.0.5.17:6789/0}
>             election epoch 1226, quorum 0,1 node15,node16
>         mgr active: node16
>      osdmap e7858: 6 osds: 6 up, 6 in; 375 remapped pgs
>             flags sortbitwise,require_jewel_osds,require_kraken_osds
>       pgmap v16570651: 600 pgs, 2 pools, 571 GB data, 146 kobjects
>             1363 GB used, 4202 GB / 5566 GB avail
>             95173/453987 objects degraded (20.964%)
>             103542/453987 objects misplaced (22.807%)
>             1/149832 unfound (0.001%)
>                  368 active+undersized+degraded+remapped+backfill_wait
>                  225 active+clean
>                    3 active+remapped+backfill_wait
>                    3 active+recovery_wait+undersized+degraded+remapped
>                    1 active+recovering+undersized+degraded+remapped
>   client io 17441 B/s rd, 271 kB/s wr, 42 op/s rd, 26 op/s wr
>
> Many pgs are stuck degraded, remapped ..etc.
>
>
> root@node16:~# ceph osd tree
> ID WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 5.81839 root default
> -2 1.81839     host node9
> 11 0.90919         osd.11       up  1.00000          1.00000
>  1 0.90919         osd.1        up  1.00000          1.00000
> -3 2.00000     host node10
>  0 1.00000         osd.0        up  1.00000          1.00000
>  2 1.00000         osd.2        up  1.00000          1.00000
> -4 2.00000     host node8
>  3 1.00000         osd.3        up  1.00000          1.00000
>  6 1.00000         osd.6        up  1.00000          1.00000
>
>
> I have attached the output of ceph osd dump
> <https://pastebin.com/TznUZVFz>​​. Interstingly you can see pg_temp .
> What does that means and why osd 7 is involved there?
>
> here is the crush map
>
> root@node16:~# cat /tmp/crush.txt
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
> tunable straw_calc_version 1
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 device4
> device 5 device5
> device 6 osd.6
> device 7 device7
> device 8 device8
> device 9 device9
> device 10 device10
> device 11 osd.11
>
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
>
> # buckets
> host node9 {
> id -2 # do not change unnecessarily
> # weight 1.818
> alg straw
> hash 0 # rjenkins1
> item osd.11 weight 0.909
> item osd.1 weight 0.909
> }
> host node10 {
> id -3 # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0 # rjenkins1
> item osd.0 weight 1.000
> item osd.2 weight 1.000
> }
> host node8 {
> id -4 # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0 # rjenkins1
> item osd.3 weight 1.000
> item osd.6 weight 1.000
> }
> root default {
> id -1 # do not change unnecessarily
> # weight 5.818
> alg straw
> hash 0 # rjenkins1
> item node9 weight 1.818
> item node10 weight 2.000
> item node8 weight 2.000
> }
>
> # rules
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
>
> # end crush map
>
>
>
> But the intersting thing is i am seeing the following line on all osd logs
>
>
> 2018-04-18 10:57:23.437006 7f883a14b700  0 -- 10.0.5.10:6802/25296 >> -
> conn(0x55f90cf8f000 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2018-04-18 10:57:26.715861 7f883a14b700  0 -- 10.0.5.10:6802/25296 >> -
> conn(0x55f90cf8f000 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2018-04-18 10:57:38.435193 7f883a14b700  0 -- 10.0.5.10:6802/25296 >> -
> conn(0x55f90d3d4800 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
> 2018-04-18 10:57:41.717710 7f883a14b700  0 -- 10.0.5.10:6802/25296 >> -
> conn(0x55f8e2944800 :6802 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0).fault with nothing to send and in the half  accept state just closed
>
>
> What does this means
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to