In an OpenStack (mitaka) cloud, backed by a ceph cluster (10.2.6 jewel), using 
libvirt/qemu (1.3.1/2.5) hypervisors on Ubuntu 14.04.5 compute and ceph hosts, 
we occasionally see hung processes (usually during boot, but otherwise as 
well), with errors reported in the instance logs as shown below.  Configuration 
is vanilla, based on openstack/ceph docs.

Neither the compute hosts nor the ceph hosts appear to be overloaded in terms 
of memory or network bandwidth, none of the 67 osds are over 80% full, nor do 
any of them appear to be overwhelmed in terms of IO.  Compute hosts and ceph 
cluster are connected via a relatively quiet 1Gb network, with an IBoE net 
between the ceph nodes.  Neither network appears overloaded.

I don’t see any related (to my eye) errors in client or server logs, even with 
20/20 logging from various components (rbd, rados, client, objectcacher, etc.)  
I’ve increased the qemu file descriptor limit (currently 64k... overkill for 
sure.)

I “feels” like a performance problem, but I can’t find any capacity issues or 
constraining bottlenecks. 

Any suggestions or insights into this situation are appreciated.  Thank you for 
your time,
--
Eric


[Fri Mar 24 20:30:40 2017] INFO: task jbd2/vda1-8:226 blocked for more than 120 
seconds.
[Fri Mar 24 20:30:40 2017]       Not tainted 3.13.0-52-generic #85-Ubuntu
[Fri Mar 24 20:30:40 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[Fri Mar 24 20:30:40 2017] jbd2/vda1-8     D ffff88043fd13180     0   226      
2 0x00000000
[Fri Mar 24 20:30:40 2017]  ffff88003728bbd8 0000000000000046 ffff880426900000 
ffff88003728bfd8
[Fri Mar 24 20:30:40 2017]  0000000000013180 0000000000013180 ffff880426900000 
ffff88043fd13a18
[Fri Mar 24 20:30:40 2017]  ffff88043ffb9478 0000000000000002 ffffffff811ef7c0 
ffff88003728bc50
[Fri Mar 24 20:30:40 2017] Call Trace:
[Fri Mar 24 20:30:40 2017]  [<ffffffff811ef7c0>] ? generic_block_bmap+0x50/0x50
[Fri Mar 24 20:30:40 2017]  [<ffffffff81726d2d>] io_schedule+0x9d/0x140
[Fri Mar 24 20:30:40 2017]  [<ffffffff811ef7ce>] sleep_on_buffer+0xe/0x20
[Fri Mar 24 20:30:40 2017]  [<ffffffff817271b2>] __wait_on_bit+0x62/0x90
[Fri Mar 24 20:30:40 2017]  [<ffffffff811ef7c0>] ? generic_block_bmap+0x50/0x50
[Fri Mar 24 20:30:40 2017]  [<ffffffff81727257>] 
out_of_line_wait_on_bit+0x77/0x90
[Fri Mar 24 20:30:40 2017]  [<ffffffff810ab180>] ? 
autoremove_wake_function+0x40/0x40
[Fri Mar 24 20:30:40 2017]  [<ffffffff811f0afa>] __wait_on_buffer+0x2a/0x30
[Fri Mar 24 20:30:40 2017]  [<ffffffff8128bb4d>] 
jbd2_journal_commit_transaction+0x185d/0x1ab0
[Fri Mar 24 20:30:40 2017]  [<ffffffff810755df>] ? 
try_to_del_timer_sync+0x4f/0x70
[Fri Mar 24 20:30:40 2017]  [<ffffffff8128fe7d>] kjournald2+0xbd/0x250
[Fri Mar 24 20:30:40 2017]  [<ffffffff810ab140>] ? 
prepare_to_wait_event+0x100/0x100
[Fri Mar 24 20:30:40 2017]  [<ffffffff8128fdc0>] ? commit_timeout+0x10/0x10
[Fri Mar 24 20:30:40 2017]  [<ffffffff8108b5d2>] kthread+0xd2/0xf0
[Fri Mar 24 20:30:40 2017]  [<ffffffff8108b500>] ? 
kthread_create_on_node+0x1c0/0x1c0
[Fri Mar 24 20:30:40 2017]  [<ffffffff8173304c>] ret_from_fork+0x7c/0xb0
[Fri Mar 24 20:30:40 2017]  [<ffffffff8108b500>] ? 
kthread_create_on_node+0x1c0/0x1c0



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to