Hello!
I have 3-node ceph cluster under ubuntu 14.04.3 with hammer 0.94.2 from
ubuntu-cloud repository. My config and crush map is attached below.
After adding a volume with cinder any of my openstack instances hung after a
small period of time with "[sda]: abort" message in VM's kernel log. When
connecting volume directly to my compute node with
rbd map --name client.openstack --keyfile client.openstack.key
openstack-hdd/volume-da53d8d0-b361-4697-94ed-218b92c1541e
I have the same thing: small amount of written data and hung after that:
Sep 15 16:36:24 compute001 kernel: [ 1620.258823] Key type ceph registered
Sep 15 16:36:24 compute001 kernel: [ 1620.259143] libceph: loaded (mon/osd
proto 15/24)
Sep 15 16:36:24 compute001 kernel: [ 1620.263448] rbd: loaded (major 251)
Sep 15 16:36:24 compute001 kernel: [ 1620.264948] libceph: client13757843 fsid
b490cb36-ab9b-4dd1-b3bf-c022061a977e
Sep 15 16:36:24 compute001 kernel: [ 1620.265359] libceph: mon2 10.0.66.3:6789
session established
Sep 15 16:36:24 compute001 kernel: [ 1620.275268] rbd0: p1
Sep 15 16:36:24 compute001 kernel: [ 1620.275484] rbd: rbd0: added with size
0xe600000
Sep 15 16:41:24 compute001 kernel: [ 1920.445112] INFO: task fio:31185 blocked
for more than 120 seconds.
Sep 15 16:41:24 compute001 kernel: [ 1920.445484] Not tainted
3.16.0-49-generic #65~14.04.1-Ubuntu
Sep 15 16:41:24 compute001 kernel: [ 1920.445835] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 15 16:41:24 compute001 kernel: [ 1920.446286] fio D
ffff881fffab30c0 0 31185 1 0x00000004
Sep 15 16:41:24 compute001 kernel: [ 1920.446295] ffff881fba167b60
0000000000000046 ffff881fac8cbd20 ffff881fba167fd8
Sep 15 16:41:24 compute001 kernel: [ 1920.446302] 00000000000130c0
00000000000130c0 ffff881fd2a18a30 ffff881fba167c88
Sep 15 16:41:24 compute001 kernel: [ 1920.446308] ffff881fba167c90
7fffffffffffffff ffff881fac8cbd20 ffff881fac8cbd20
Sep 15 16:41:24 compute001 kernel: [ 1920.446315] Call Trace:
Sep 15 16:41:24 compute001 kernel: [ 1920.446333] [<ffffffff8176aa19>]
schedule+0x29/0x70
Sep 15 16:41:24 compute001 kernel: [ 1920.446338] [<ffffffff81769df9>]
schedule_timeout+0x229/0x2a0
Sep 15 16:41:24 compute001 kernel: [ 1920.446350] [<ffffffff810b4c54>] ?
__wake_up+0x44/0x50
Sep 15 16:41:24 compute001 kernel: [ 1920.446357] [<ffffffff810d4158>] ?
__call_rcu_nocb_enqueue+0xc8/0xd0
Sep 15 16:41:24 compute001 kernel: [ 1920.446363] [<ffffffff8176b516>]
wait_for_completion+0xa6/0x160
Sep 15 16:41:24 compute001 kernel: [ 1920.446370] [<ffffffff810a1b30>] ?
wake_up_state+0x20/0x20
Sep 15 16:41:24 compute001 kernel: [ 1920.446380] [<ffffffff8121dbb0>]
exit_aio+0xe0/0xf0
Sep 15 16:41:24 compute001 kernel: [ 1920.446388] [<ffffffff8106ae40>]
mmput+0x30/0x120
Sep 15 16:41:24 compute001 kernel: [ 1920.446395] [<ffffffff8107031c>]
do_exit+0x26c/0xa60
Sep 15 16:41:24 compute001 kernel: [ 1920.446401] [<ffffffff810aafd2>] ?
dequeue_entity+0x142/0x5c0
Sep 15 16:41:24 compute001 kernel: [ 1920.446407] [<ffffffff81070b8f>]
do_group_exit+0x3f/0xa0
Sep 15 16:41:24 compute001 kernel: [ 1920.446416] [<ffffffff81080690>]
get_signal_to_deliver+0x1d0/0x6f0
Sep 15 16:41:24 compute001 kernel: [ 1920.446426] [<ffffffff81012538>]
do_signal+0x48/0xad0
Sep 15 16:41:24 compute001 kernel: [ 1920.446434] [<ffffffff81094a1a>] ?
hrtimer_cancel+0x1a/0x30
Sep 15 16:41:24 compute001 kernel: [ 1920.446440] [<ffffffff8121d0f7>] ?
read_events+0x207/0x230
Sep 15 16:41:24 compute001 kernel: [ 1920.446445] [<ffffffff81094420>] ?
hrtimer_get_res+0x50/0x50
Sep 15 16:41:24 compute001 kernel: [ 1920.446451] [<ffffffff81013029>]
do_notify_resume+0x69/0xb0
Sep 15 16:41:24 compute001 kernel: [ 1920.446459] [<ffffffff8176ed4a>]
int_signal+0x12/0x17
At the same time I have no problems with cephfs mounted on this host using fuse.
I had rebuilt my cluster with almost default config and ended up with strange
behavior:
When using crush map named "crush-good" cluster is doing well. When removing
unused root "default" or even osds from hosts in this root, problem comes back.
Adding osds and hosts in "default" root fix the problem.
Hosts storage00[1-3] listed in /etc/hosts, even [ssd|hdd]-st00[1-3] listed
there with their public ips, even though I know that this is not necessary.
All of OSD run on ext4 made so:
mkfs.ext4 -L osd-[n] -m0 -Tlargefile /dev/drive
mounted with noatime.
all journals lies on separate ssds, 2 per host, one for ssd osds, one for hdd
osds, made as partitions 24GB-sized.
crush-good (almost copy from Ceph site :)):
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 20 osd.20
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 40 osd.40
device 50 osd.50
device 51 osd.51
device 52 osd.52
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host ssd-st001 {
id -1 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.000
}
host ssd-st002 {
id -2 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.20 weight 1.000
}
host ssd-st003 {
id -3 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.40 weight 1.000
}
host hdd-st001 {
id -4 # do not change unnecessarily
# weight 3.000
alg straw
hash 0 # rjenkins1
item osd.10 weight 1.000
item osd.11 weight 1.000
item osd.12 weight 1.000
}
host hdd-st002 {
id -5 # do not change unnecessarily
# weight 3.000
alg straw
hash 0 # rjenkins1
item osd.30 weight 1.000
item osd.31 weight 1.000
item osd.32 weight 1.000
}
host hdd-st003 {
id -6 # do not change unnecessarily
# weight 3.000
alg straw
hash 0 # rjenkins1
item osd.51 weight 1.000
item osd.52 weight 1.000
item osd.50 weight 1.000
}
root hdd {
id -7 # do not change unnecessarily
# weight 9.000
alg straw
hash 0 # rjenkins1
item hdd-st001 weight 3.000
item hdd-st002 weight 3.000
item hdd-st003 weight 3.000
}
root ssd {
id -8 # do not change unnecessarily
# weight 3.000
alg straw
hash 0 # rjenkins1
item ssd-st001 weight 1.000
item ssd-st002 weight 1.000
item ssd-st003 weight 1.000
}
host storage001 {
id -9 # do not change unnecessarily
# weight 4.000
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.000
item osd.10 weight 1.000
item osd.11 weight 1.000
item osd.12 weight 1.000
}
host storage002 {
id -11 # do not change unnecessarily
# weight 4.000
alg straw2
hash 0 # rjenkins1
item osd.20 weight 1.000
item osd.30 weight 1.000
item osd.31 weight 1.000
item osd.32 weight 1.000
}
host storage003 {
id -12 # do not change unnecessarily
# weight 4.000
alg straw2
hash 0 # rjenkins1
item osd.52 weight 1.000
item osd.51 weight 1.000
item osd.50 weight 1.000
item osd.40 weight 1.000
}
root default {
id -10 # do not change unnecessarily
# weight 12.000
alg straw2
hash 0 # rjenkins1
item storage001 weight 4.000
item storage002 weight 4.000
item storage003 weight 4.000
}
# rules
rule data {
ruleset 0
type replicated
min_size 2
max_size 2
step take hdd
step chooseleaf firstn 0 type host
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 0
max_size 10
step take hdd
step chooseleaf firstn 0 type host
step emit
}
rule rbd {
ruleset 2
type replicated
min_size 0
max_size 10
step take hdd
step chooseleaf firstn 0 type host
step emit
}
rule hdd {
ruleset 3
type replicated
min_size 0
max_size 10
step take hdd
step chooseleaf firstn 0 type host
step emit
}
rule ssd {
ruleset 4
type replicated
min_size 0
max_size 4
step take ssd
step chooseleaf firstn 0 type host
step emit
}
rule ssd-primary {
ruleset 5
type replicated
min_size 5
max_size 10
step take ssd
step chooseleaf firstn 1 type host
step emit
step take hdd
step chooseleaf firstn -1 type host
step emit
}
# end crush map
My ceph.conf:
[global]
fsid = 85456792-2ded-4d61-a021-20f6038f2dee
mon_initial_members = storage001,storage002,storage003
public_network = 10.0.66.0/24
cluster_network = 10.0.65.0/24
auth cluster required = none
auth service required = none
auth client required = none
filestore_xattr_use_omap = true
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 256
osd_pool_default_pgp_num = 256
[mds]
mds_data = /var/lib/ceph/mds/mds.$id
keyring = /var/lib/ceph/mds/mds.$id/mds.$id.keyring
[mds.a]
public_addr = 10.0.66.1
cluster_addr = 10.0.65.1
host = storage001
[mds.b]
public_addr = 10.0.66.2
cluster_addr = 10.0.65.2
host = storage002
[mds.c]
public_addr = 10.0.66.3
cluster_addr = 10.0.65.3
host = storage003
[mon]
mon_host = 10.0.66.1,10.0.66.2,10.0.66.3
[mon.storage001]
mon_addr = 10.0.66.1
host = storage001
[mon.storage002]
mon_addr = 10.0.66.2
host = storage002
[mon.storage003]
mon_addr = 10.0.66.3
host = storage003
[client.openstack]
keyring = /etc/ceph/client.openstack.keyring
[osd]
osd_crush_update_on_start = false
[osd.0]
cluster_addr = 10.0.65.1
public_addr = 10.0.66.1
host = storage001
[osd.10]
cluster_addr = 10.0.65.1
public_addr = 10.0.66.1
host = storage001
[osd.11]
cluster_addr = 10.0.65.1
public_addr = 10.0.66.1
host = storage001
[osd.12]
cluster_addr = 10.0.65.1
public_addr = 10.0.66.1
host = storage001
[osd.20]
cluster_addr = 10.0.65.2
public_addr = 10.0.66.2
host = storage002
[osd.30]
cluster_addr = 10.0.65.2
public_addr = 10.0.66.2
host = storage002
[osd.31]
cluster_addr = 10.0.65.2
public_addr = 10.0.66.2
host = storage002
[osd.32]
cluster_addr = 10.0.65.2
public_addr = 10.0.66.2
host = storage002
[osd.40]
cluster_addr = 10.0.65.3
public_addr = 10.0.66.3
host = storage003
[osd.50]
cluster_addr = 10.0.65.3
public_addr = 10.0.66.3
host = storage003
[osd.51]
cluster_addr = 10.0.65.3
public_addr = 10.0.66.3
host = storage003
[osd.52]
cluster_addr = 10.0.65.3
public_addr = 10.0.66.3
host = storage003
My volumes:
pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 3 object_hash rjenkins
pg_num 64 pgp_num 64 last_change 92 flags hashpspool stripe_width 0
pool 4 'openstack-img' replicated size 2 min_size 1 crush_ruleset 3 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 187 flags hashpspool stripe_width 0
pool 5 'openstack-hdd' replicated size 2 min_size 1 crush_ruleset 3 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 114 flags hashpspool stripe_width 0
pool 6 'openstack-ssd' replicated size 2 min_size 1 crush_ruleset 4 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 118 flags hashpspool stripe_width 0
pool 7 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 3
object_hash rjenkins pg_num 64 pgp_num 64 last_change 141 flags hashpspool
stripe_width 0
pool 8 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 3 object_hash
rjenkins pg_num 128 pgp_num 128 last_change 145 flags hashpspool
crash_replay_interval 45 stripe_width 0
First one was added by ceph setup and not used by me. I hava only changed
ruleset to 3.
So, why I need "default" root with osds in it? And why this is not described in
docs? Or, maybe, I have mistaken understanding it?
--
WBR, Max A. Krasilnikov
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com