[ceph-users] Health_Warn recovery stuck / crushmap problem?

Jonas Stunkat Tue, 24 Jan 2017 07:43:02 -0800

All OSD´s and Monitors are up from what I can see.
I read through the troubleshooting like mentioned in the ceph documentation for 
PGs and came to the conclusion that nothing there would help me, so I didn´t 
try anything - except restarting / rebooting OSD´s and Monitors.


How do I recover from this, it looks to me that the data itself should be safe 
for now, but why is it not restoring?
I guess the problem may be the crushmap.

Here are some outputs:

#ceph health detail

HEALTH_WARN 475 pgs degraded; 640 pgs stale; 475 pgs stuck degraded; 640 pgs 
stuck stale; 640 pgs stuck unclean; 475 pgs stuck undersized; 475 pgs 
undersized; recovery 104812/279550 objects degraded (37.493%); recovery 
69926/279550 objects misplaced (25.014%)
pg 3.ec is stuck unclean for 3326815.935321, current state 
stale+active+remapped, last acting [7,6]
pg 3.ed is stuck unclean for 3288818.682456, current state 
stale+active+remapped, last acting [6,7]
pg 3.ee is stuck unclean for 409973.052061, current state 
stale+active+undersized+degraded, last acting [7]
pg 3.ef is stuck unclean for 3357894.554762, current state 
stale+active+undersized+degraded, last acting [7]
pg 3.e8 is stuck unclean for 384815.518837, current state 
stale+active+undersized+degraded, last acting [6]
pg 3.e9 is stuck unclean for 3274554.591000, current state 
stale+active+remapped, last acting [6,7]
......

################################################################################

This is the crushmap I created and intended to use and thought I used for the 
past 2 months:
- pvestorage1-ssd and pvestorage1-platter are the same hosts, it seems like 
this is not possible but I never noticed
- likewise with pvestorage2

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host pvestorage1-ssd {
 id -2 # do not change unnecessarily
 # weight 1.740
 alg straw
 hash 0 # rjenkins1
 item osd.0 weight 0.870
 item osd.1 weight 0.870
}
host pvestorage2-ssd {
 id -3 # do not change unnecessarily
 # weight 1.740
 alg straw
 hash 0 # rjenkins1
 item osd.2 weight 0.870
 item osd.3 weight 0.870
}
host pvestorage1-platter {
 id -4 # do not change unnecessarily
 # weight 4
 alg straw
 hash 0 # rjenkins1
 item osd.4 weight 2.000
 item osd.5 weight 2.000
}
host pvestorage2-platter {
 id -5 # do not change unnecessarily
 # weight 4
 alg straw
 hash 0 # rjenkins1
 item osd.6 weight 2.000
 item osd.7 weight 2.000
}

root ssd {
 id -1 # do not change unnecessarily
 # weight 3.480
 alg straw
 hash 0 # rjenkins1
 item pvestorage1-ssd weight 1.740
 item pvestorage2-ssd weight 1.740
}

root platter {
 id -6 # do not change unnecessarily
 # weight 8
 alg straw
 hash 0 # rjenkins1
 item pvestorage1-platter weight 4.000
 item pvestorage2-platter weight 4.000
}

# rules
rule ssd {
 ruleset 0
 type replicated
 min_size 1
 max_size 10
 step take ssd
 step chooseleaf firstn 0 type host
 step emit
}

rule platter {
 ruleset 1
 type replicated
 min_size 1
 max_size 10
 step take platter
 step chooseleaf firstn 0 type host
 step emit
}
# end crush map
################################################################################

This is the what ceph made of this crushmap and the one that is actually used 
right now, I never looked -_- :

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host pvestorage1-ssd {
 id -2 # do not change unnecessarily
 # weight 0.000
 alg straw
 hash 0 # rjenkins1
}
host pvestorage2-ssd {
 id -3 # do not change unnecessarily
 # weight 0.000
 alg straw
 hash 0 # rjenkins1
}
root ssd {
 id -1 # do not change unnecessarily
 # weight 0.000
 alg straw
 hash 0 # rjenkins1
 item pvestorage1-ssd weight 0.000
 item pvestorage2-ssd weight 0.000
}
host pvestorage1-platter {
 id -4 # do not change unnecessarily
 # weight 0.000
 alg straw
 hash 0 # rjenkins1
}
host pvestorage2-platter {
 id -5 # do not change unnecessarily
 # weight 0.000
 alg straw
 hash 0 # rjenkins1
}
root platter {
 id -6 # do not change unnecessarily
 # weight 0.000
 alg straw
 hash 0 # rjenkins1
 item pvestorage1-platter weight 0.000
 item pvestorage2-platter weight 0.000
}
host pvestorage1 {
 id -7 # do not change unnecessarily
 # weight 5.740
 alg straw
 hash 0 # rjenkins1
 item osd.5 weight 2.000
 item osd.4 weight 2.000
 item osd.1 weight 0.870
 item osd.0 weight 0.870
}
host pvestorage2 {
 id -9 # do not change unnecessarily
 # weight 5.740
 alg straw
 hash 0 # rjenkins1
 item osd.3 weight 0.870
 item osd.2 weight 0.870
 item osd.6 weight 2.000
 item osd.7 weight 2.000
}
root default {
 id -8 # do not change unnecessarily
 # weight 11.480
 alg straw
 hash 0 # rjenkins1
 item pvestorage1 weight 5.740
 item pvestorage2 weight 5.740
}

# rules
rule ssd {
 ruleset 0
 type replicated
 min_size 1
 max_size 10
 step take ssd
 step chooseleaf firstn 0 type host
 step emit
}
rule platter {
 ruleset 1
 type replicated
 min_size 1
 max_size 10
 step take platter
 step chooseleaf firstn 0 type host
 step emit
}

# end crush map
################################################################################

How do I recover from this?

Best Regards
Jonas

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Health_Warn recovery stuck / crushmap problem?

Reply via email to