Ceph slow request & unstable issue

Chen, Xiaoxi Tue, 15 Jan 2013 16:58:57 -0800

Hi list,
        We are suffering from OSD or OS down when there is continuing high 
pressure on the Ceph rack.
        Basically we are on Ubuntu 12.04+ Ceph 0.56.1, 6 nodes, in each nodes 
with 20 * spindles + 4* SSDs as journal.(120 spindles in total)
        We create a lots of RBD volumes (say 240),mounting to 16 different 
client machines ( 15 RBD Volumes/ client) and running DD concurrently on top of 
each RBD.
        
        The issues are:
1. Slow requests 
　　From the list-archive it seems solved in 0.56.1 but we still notice such 
warning
2. OSD Down or even host down
Like the message below.Seems some OSD has been blocking there for quite a long 
time.


        Suggestions are highly appreciate.Thanks
                                                                                
                                                                                
        Xiaoxi 

_____________________________________________

Bad news:

I have  back all my Ceph machine’s OS to kernel  3.2.0-23, which Ubuntu 12.04 
use.
I run dd command (dd if=/dev/zero bs=1M count=60000 of=/dev/rbd${i} & )on Ceph 
client to create data prepare test at last night.
Now, I have one machine down (can’t be reached by ping), another two machine 
has all OSD daemon down, while the three left has some daemon down.

I have many warnings in OSD log like this:

no flag points reached
2013-01-15 19:14:22.769898 7f20a2d57700  0 log [WRN] : slow request 52.218106 
seconds old, received at 2013-01-15 19:13:30.551718: 
osd_op(client.10674.1:1002417 rb.0.27a8.6b8b4567.000000000eba [write 
3145728~524288] 2.c61810ee RETRY) currently waiting for sub ops
2013-01-15 19:14:23.770077 7f20a2d57700  0 log [WRN] : 21 slow requests, 6 
included below; oldest blocked for > 1132.138983 secs
2013-01-15 19:14:23.770086 7f20a2d57700  0 log [WRN] : slow request 53.216404 
seconds old, received at 2013-01-15 19:13:30.553616: 
osd_op(client.10671.1:1066860 rb.0.282c.6b8b4567.000000001057 [write 
2621440~524288] 2.ea7acebc) currently waiting for sub ops
2013-01-15 19:14:23.770096 7f20a2d57700  0 log [WRN] : slow request 51.442032 
seconds old, received at 2013-01-15 19:13:32.327988: 
osd_op(client.10674.1:1002418

Similar info in dmesg we have saw pervious:

[21199.036476] INFO: task ceph-osd:7788 blocked for more than 120 seconds.
[21199.037493] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[21199.038841] ceph-osd        D 0000000000000006     0  7788      1 0x00000000
[21199.038844]  ffff880fefdafcc8 0000000000000086 0000000000000000 
ffffffffffffffe0
[21199.038848]  ffff880fefdaffd8 ffff880fefdaffd8 ffff880fefdaffd8 
0000000000013780
[21199.038852]  ffff88081aa58000 ffff880f68f52de0 ffff880f68f52de0 
ffff882017556200
[21199.038856] Call Trace:
[21199.038858]  [<ffffffff8165a55f>] schedule+0x3f/0x60
[21199.038861]  [<ffffffff8106b7e5>] exit_mm+0x85/0x130
[21199.038864]  [<ffffffff8106b9fe>] do_exit+0x16e/0x420
[21199.038866]  [<ffffffff8109d88f>] ? __unqueue_futex+0x3f/0x80
[21199.038869]  [<ffffffff8107a19a>] ? __dequeue_signal+0x6a/0xb0
[21199.038872]  [<ffffffff8106be54>] do_group_exit+0x44/0xa0
[21199.038874]  [<ffffffff8107ccdc>] get_signal_to_deliver+0x21c/0x420
[21199.038877]  [<ffffffff81013865>] do_signal+0x45/0x130
[21199.038880]  [<ffffffff810a091c>] ? do_futex+0x7c/0x1b0
[21199.038882]  [<ffffffff810a0b5a>] ? sys_futex+0x10a/0x1a0
[21199.038885]  [<ffffffff81013b15>] do_notify_resume+0x65/0x80
[21199.038887]  [<ffffffff81664d50>] int_signal+0x12/0x17


Thanks.
-chen

N�Р骒r��y����b�X�肚�v�^�)藓{.n�+���z�]z鳐�{ay������,j��f＂�h���z��wア�
⒎�j:+v���w�j�m������赙zZ+�����茛j"��!�i

Ceph slow request & unstable issue

Reply via email to