On Tue, Jan 6, 2015 at 3:31 PM, Chaitanya Huilgol
<[email protected]> wrote:
> Hi Ilya,
>
> The RBD crash on OSD nodes going away is routinely hit in our setups.
> We have not been able to get a good stack trace for this one due to our 
> console capture issues and these don't end up in the syslogs either after the 
> crash. Will get you the traces soon.
> Most of the times this happens when all the OSD nodes go away at once.  This 
> could have probably been fixed by one of the following commits?
>
> Ilya Dryomov
> libceph: change from BUG to WARN for __remove_osd() asserts …
> idryomov authored on Nov 5
> cc9f1f5
> Ilya Dryomov
> libceph: clear r_req_lru_item in __unregister_linger_request() …
> idryomov authored on Nov 5
> ba9d114
> Ilya Dryomov
> libceph: unlink from o_linger_requests when clearing r_osd …
> idryomov authored on Nov 4
> a390de0

Yes, but probably others as well.

>
> Also, We have encountered a few other issues listed below
>
> (1) Soft Lockup issue
> Dec 10 11:22:28 rack3-client-1 kernel: [661597.506625] BUG: soft lockup - 
> CPU#2 stuck for 22s! [java:29169] --- (vdbench process)
> .
> .
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace:
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] 
> con_work+0x298/0x640 [libceph]
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] 
> process_one_work+0x182/0x450
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] 
> worker_thread+0x121/0x410
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? 
> rescuer_thread+0x3e0/0x3e0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] 
> kthread+0xd2/0xf0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? 
> kthread_create_on_node+0x1d0/0x1d0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] 
> ret_from_fork+0x7c/0xb0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? 
> kthread_create_on_node+0x1d0/0x1d0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.514121] Code: ff ff 48 89 df e8 
> e3 f1 ff ff 48 8b 7d a8 e8 7a 8c 0e e1 48 8b 7d b0 e8 41 d8 a7 e0 48 83 c4 30 
> 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 48 8b 45 b8 49 8b 0e 4c 89 f2 48 c7 
> c6 d0 76 64 a0 48 c7
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.663443] RIP [<ffffffffa063340e>] 
> osd_reset+0x22e/0x2c0 [libceph]
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.712105] RSP <ffff880a22b8bd80>
>
> (2) Soft lockup when OSDs are flapping
>
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.089489] BUG: soft lockup - 
> CPU#4 stuck for 23s! [kworker/4:0:45012]
> .
> .
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098648] Call Trace:
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098653] [<ffffffffa030d963>] 
> kick_requests+0x1e3/0x440 [libceph]
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098657] [<ffffffffa030df98>] 
> ceph_osdc_handle_map+0x2a8/0x620 [libceph]
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098662] [<ffffffffa030e55b>] 
> dispatch+0x24b/0xb20 [libceph]
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098665] [<ffffffffa0301c08>] ? 
> ceph_tcp_recvmsg+0x48/0x60 [libceph]
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098669] [<ffffffffa030552f>] 
> con_work+0x164f/0x2b60 [libceph]
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098672] [<ffffffff8101b7d9>] ? 
> sched_clock+0x9/0x10
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098674] [<ffffffff8101b763>] ? 
> native_sched_clock+0x13/0x80
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098676] [<ffffffff8101b7d9>] ? 
> sched_clock+0x9/0x10
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098679] [<ffffffff8109d2d5>] ? 
> sched_clock_cpu+0xb5/0x100
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098681] [<ffffffff8109df6d>] ? 
> vtime_common_task_switch+0x3d/0x40
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098684] [<ffffffff810838a2>] 
> process_one_work+0x182/0x450
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098686] [<ffffffff81084641>] 
> worker_thread+0x121/0x410
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098688] [<ffffffff81084520>] ? 
> rescuer_thread+0x3e0/0x3e0
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098690] [<ffffffff8108b312>] 
> kthread+0xd2/0xf0
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098692] [<ffffffff8108b240>] ? 
> kthread_create_on_node+0x1d0/0x1d0
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098695] [<ffffffff8172637c>] 
> ret_from_fork+0x7c/0xb0
> Dec 18 18:25:10 rack3-client-2 kernel: [157126.098697] [<ffffffff8108b240>] ? 
> kthread_create_on_node+0x1d0/0x1d0
>
> (3)  BUG_ON(!list_empty(&req->r_req_lru_item));
>
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320359.828209] kernel BUG at 
> /build/buildd/linux-3.13.0/net/ceph/osd_client.c:892!
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace:
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] 
> con_work+0x298/0x640 [libceph]
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] 
> process_one_work+0x182/0x450
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] 
> worker_thread+0x121/0x410
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? 
> rescuer_thread+0x3e0/0x3e0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] 
> kthread+0xd2/0xf0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? 
> kthread_create_on_node+0x1d0/0x1d0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] 
> ret_from_fork+0x7c/0xb0
> Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? 
> kthread_create_on_node+0x1d0/0x1d0
>
> (4) img_request null
> Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865] Assertion failure in 
> rbd_img_obj_callback() at line 2127:
> Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865]
> Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865]     
> rbd_assert(img_request != NULL);
>
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.257322]  [<ffffffffa01a5897>] 
> rbd_obj_request_complete+0x27/0x70 [rbd]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.268450]  [<ffffffffa01a8d4f>] 
> rbd_osd_req_callback+0xdf/0x4e0 [rbd]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.279182]  [<ffffffffa039e262>] 
> dispatch+0x4a2/0x900 [libceph]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.289159]  [<ffffffffa039494b>] 
> try_read+0x4ab/0x10d0 [libceph]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.299236]  [<ffffffffa0396362>] ? 
> try_write+0xa42/0xe30 [libceph]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.309777]  [<ffffffff8101b7d9>] ? 
> sched_clock+0x9/0x10
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.318627]  [<ffffffff8101b763>] ? 
> native_sched_clock+0x13/0x80
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.332347]  [<ffffffff8101b7d9>] ? 
> sched_clock+0x9/0x10
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.341095]  [<ffffffff8109d2d5>] ? 
> sched_clock_cpu+0xb5/0x100
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.351061]  [<ffffffffa0396809>] 
> con_work+0xb9/0x640 [libceph]
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.361003]  [<ffffffff810838a2>] 
> process_one_work+0x182/0x450
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.370752]  [<ffffffff81084641>] 
> worker_thread+0x121/0x410
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.379816]  [<ffffffff81084520>] ? 
> rescuer_thread+0x3e0/0x3e0
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.389173]  [<ffffffff8108b312>] 
> kthread+0xd2/0xf0
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.396898]  [<ffffffff8108b240>] ? 
> kthread_create_on_node+0x1d0/0x1d0
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.407506]  [<ffffffff8172637c>] 
> ret_from_fork+0x7c/0xb0
> Dec 12 08:07:50 rack1-ram-6 kernel: [251597.416181]  [<ffffffff8108b240>] ? 
> kthread_create_on_node+0x1d0/0x1d0
> This is similar to: http://tracker.ceph.com/issues/8378
>
> Saw that the rhel7a branch has many of the latest fixes and is somewhat 
> compatible with 3.13 kernels,
> For validation, we have taken the rhel7a ceph-client branch and with minor 
> modification gotten it to compile with 3.13.0 headers. With this we did not 
> hit any issues (expect issue-2).

What do you mean by "expect issue-2"?

(3) and (4) should be fixed in rhel7-a.  Can't say anything about (1)
and (2) - please report back if you see any soft lockup splats on
rhel7-a.

> We understand that is not the right approach for Ubuntu, It would be great if 
> we could get the fixes into Ubuntu 14.04 kernels as well.

It may not be the right approach, but in many ways it's better than
a set of selected backports.  While working on another report I found
a couple easy-to-backport patches that are missing from Ubuntu 3.13
series and will forward them to stable guys, but, for those who can
build their own kernels at least, branches like rhel7-a are best.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to