On Tue, Jan 6, 2015 at 3:31 PM, Chaitanya Huilgol <[email protected]> wrote: > Hi Ilya, > > The RBD crash on OSD nodes going away is routinely hit in our setups. > We have not been able to get a good stack trace for this one due to our > console capture issues and these don't end up in the syslogs either after the > crash. Will get you the traces soon. > Most of the times this happens when all the OSD nodes go away at once. This > could have probably been fixed by one of the following commits? > > Ilya Dryomov > libceph: change from BUG to WARN for __remove_osd() asserts … > idryomov authored on Nov 5 > cc9f1f5 > Ilya Dryomov > libceph: clear r_req_lru_item in __unregister_linger_request() … > idryomov authored on Nov 5 > ba9d114 > Ilya Dryomov > libceph: unlink from o_linger_requests when clearing r_osd … > idryomov authored on Nov 4 > a390de0
Yes, but probably others as well. > > Also, We have encountered a few other issues listed below > > (1) Soft Lockup issue > Dec 10 11:22:28 rack3-client-1 kernel: [661597.506625] BUG: soft lockup - > CPU#2 stuck for 22s! [java:29169] --- (vdbench process) > . > . > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace: > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] > con_work+0x298/0x640 [libceph] > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] > process_one_work+0x182/0x450 > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] > worker_thread+0x121/0x410 > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? > rescuer_thread+0x3e0/0x3e0 > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] > kthread+0xd2/0xf0 > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? > kthread_create_on_node+0x1d0/0x1d0 > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] > ret_from_fork+0x7c/0xb0 > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? > kthread_create_on_node+0x1d0/0x1d0 > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.514121] Code: ff ff 48 89 df e8 > e3 f1 ff ff 48 8b 7d a8 e8 7a 8c 0e e1 48 8b 7d b0 e8 41 d8 a7 e0 48 83 c4 30 > 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 48 8b 45 b8 49 8b 0e 4c 89 f2 48 c7 > c6 d0 76 64 a0 48 c7 > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.663443] RIP [<ffffffffa063340e>] > osd_reset+0x22e/0x2c0 [libceph] > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.712105] RSP <ffff880a22b8bd80> > > (2) Soft lockup when OSDs are flapping > > Dec 18 18:25:10 rack3-client-2 kernel: [157126.089489] BUG: soft lockup - > CPU#4 stuck for 23s! [kworker/4:0:45012] > . > . > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098648] Call Trace: > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098653] [<ffffffffa030d963>] > kick_requests+0x1e3/0x440 [libceph] > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098657] [<ffffffffa030df98>] > ceph_osdc_handle_map+0x2a8/0x620 [libceph] > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098662] [<ffffffffa030e55b>] > dispatch+0x24b/0xb20 [libceph] > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098665] [<ffffffffa0301c08>] ? > ceph_tcp_recvmsg+0x48/0x60 [libceph] > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098669] [<ffffffffa030552f>] > con_work+0x164f/0x2b60 [libceph] > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098672] [<ffffffff8101b7d9>] ? > sched_clock+0x9/0x10 > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098674] [<ffffffff8101b763>] ? > native_sched_clock+0x13/0x80 > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098676] [<ffffffff8101b7d9>] ? > sched_clock+0x9/0x10 > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098679] [<ffffffff8109d2d5>] ? > sched_clock_cpu+0xb5/0x100 > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098681] [<ffffffff8109df6d>] ? > vtime_common_task_switch+0x3d/0x40 > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098684] [<ffffffff810838a2>] > process_one_work+0x182/0x450 > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098686] [<ffffffff81084641>] > worker_thread+0x121/0x410 > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098688] [<ffffffff81084520>] ? > rescuer_thread+0x3e0/0x3e0 > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098690] [<ffffffff8108b312>] > kthread+0xd2/0xf0 > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098692] [<ffffffff8108b240>] ? > kthread_create_on_node+0x1d0/0x1d0 > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098695] [<ffffffff8172637c>] > ret_from_fork+0x7c/0xb0 > Dec 18 18:25:10 rack3-client-2 kernel: [157126.098697] [<ffffffff8108b240>] ? > kthread_create_on_node+0x1d0/0x1d0 > > (3) BUG_ON(!list_empty(&req->r_req_lru_item)); > > Dec 4 17:14:33 rack6-ramp-4 kernel: [320359.828209] kernel BUG at > /build/buildd/linux-3.13.0/net/ceph/osd_client.c:892! > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace: > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] > con_work+0x298/0x640 [libceph] > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] > process_one_work+0x182/0x450 > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] > worker_thread+0x121/0x410 > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? > rescuer_thread+0x3e0/0x3e0 > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] > kthread+0xd2/0xf0 > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? > kthread_create_on_node+0x1d0/0x1d0 > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] > ret_from_fork+0x7c/0xb0 > Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? > kthread_create_on_node+0x1d0/0x1d0 > > (4) img_request null > Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865] Assertion failure in > rbd_img_obj_callback() at line 2127: > Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865] > Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865] > rbd_assert(img_request != NULL); > > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.257322] [<ffffffffa01a5897>] > rbd_obj_request_complete+0x27/0x70 [rbd] > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.268450] [<ffffffffa01a8d4f>] > rbd_osd_req_callback+0xdf/0x4e0 [rbd] > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.279182] [<ffffffffa039e262>] > dispatch+0x4a2/0x900 [libceph] > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.289159] [<ffffffffa039494b>] > try_read+0x4ab/0x10d0 [libceph] > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.299236] [<ffffffffa0396362>] ? > try_write+0xa42/0xe30 [libceph] > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.309777] [<ffffffff8101b7d9>] ? > sched_clock+0x9/0x10 > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.318627] [<ffffffff8101b763>] ? > native_sched_clock+0x13/0x80 > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.332347] [<ffffffff8101b7d9>] ? > sched_clock+0x9/0x10 > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.341095] [<ffffffff8109d2d5>] ? > sched_clock_cpu+0xb5/0x100 > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.351061] [<ffffffffa0396809>] > con_work+0xb9/0x640 [libceph] > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.361003] [<ffffffff810838a2>] > process_one_work+0x182/0x450 > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.370752] [<ffffffff81084641>] > worker_thread+0x121/0x410 > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.379816] [<ffffffff81084520>] ? > rescuer_thread+0x3e0/0x3e0 > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.389173] [<ffffffff8108b312>] > kthread+0xd2/0xf0 > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.396898] [<ffffffff8108b240>] ? > kthread_create_on_node+0x1d0/0x1d0 > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.407506] [<ffffffff8172637c>] > ret_from_fork+0x7c/0xb0 > Dec 12 08:07:50 rack1-ram-6 kernel: [251597.416181] [<ffffffff8108b240>] ? > kthread_create_on_node+0x1d0/0x1d0 > This is similar to: http://tracker.ceph.com/issues/8378 > > Saw that the rhel7a branch has many of the latest fixes and is somewhat > compatible with 3.13 kernels, > For validation, we have taken the rhel7a ceph-client branch and with minor > modification gotten it to compile with 3.13.0 headers. With this we did not > hit any issues (expect issue-2). What do you mean by "expect issue-2"? (3) and (4) should be fixed in rhel7-a. Can't say anything about (1) and (2) - please report back if you see any soft lockup splats on rhel7-a. > We understand that is not the right approach for Ubuntu, It would be great if > we could get the fixes into Ubuntu 14.04 kernels as well. It may not be the right approach, but in many ways it's better than a set of selected backports. While working on another report I found a couple easy-to-backport patches that are missing from Ubuntu 3.13 series and will forward them to stable guys, but, for those who can build their own kernels at least, branches like rhel7-a are best. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
