Hi Ilya,

The RBD crash on OSD nodes going away is routinely hit in our setups.
We have not been able to get a good stack trace for this one due to our console 
capture issues and these don't end up in the syslogs either after the crash. 
Will get you the traces soon.
Most of the times this happens when all the OSD nodes go away at once.  This 
could have probably been fixed by one of the following commits?

Ilya Dryomov
libceph: change from BUG to WARN for __remove_osd() asserts …
idryomov authored on Nov 5
cc9f1f5
Ilya Dryomov
libceph: clear r_req_lru_item in __unregister_linger_request() …
idryomov authored on Nov 5
ba9d114
Ilya Dryomov
libceph: unlink from o_linger_requests when clearing r_osd …
idryomov authored on Nov 4
a390de0

Also, We have encountered a few other issues listed below

(1) Soft Lockup issue
Dec 10 11:22:28 rack3-client-1 kernel: [661597.506625] BUG: soft lockup - CPU#2 
stuck for 22s! [java:29169] --- (vdbench process)
.
.
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace:
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] 
con_work+0x298/0x640 [libceph]
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] 
process_one_work+0x182/0x450
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] 
worker_thread+0x121/0x410
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? 
rescuer_thread+0x3e0/0x3e0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] 
kthread+0xd2/0xf0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? 
kthread_create_on_node+0x1d0/0x1d0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] 
ret_from_fork+0x7c/0xb0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? 
kthread_create_on_node+0x1d0/0x1d0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.514121] Code: ff ff 48 89 df e8 e3 
f1 ff ff 48 8b 7d a8 e8 7a 8c 0e e1 48 8b 7d b0 e8 41 d8 a7 e0 48 83 c4 30 5b 
41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 48 8b 45 b8 49 8b 0e 4c 89 f2 48 c7 c6 d0 
76 64 a0 48 c7
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.663443] RIP [<ffffffffa063340e>] 
osd_reset+0x22e/0x2c0 [libceph]
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.712105] RSP <ffff880a22b8bd80>

(2) Soft lockup when OSDs are flapping

Dec 18 18:25:10 rack3-client-2 kernel: [157126.089489] BUG: soft lockup - CPU#4 
stuck for 23s! [kworker/4:0:45012]
.
.
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098648] Call Trace:
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098653] [<ffffffffa030d963>] 
kick_requests+0x1e3/0x440 [libceph]
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098657] [<ffffffffa030df98>] 
ceph_osdc_handle_map+0x2a8/0x620 [libceph]
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098662] [<ffffffffa030e55b>] 
dispatch+0x24b/0xb20 [libceph]
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098665] [<ffffffffa0301c08>] ? 
ceph_tcp_recvmsg+0x48/0x60 [libceph]
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098669] [<ffffffffa030552f>] 
con_work+0x164f/0x2b60 [libceph]
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098672] [<ffffffff8101b7d9>] ? 
sched_clock+0x9/0x10
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098674] [<ffffffff8101b763>] ? 
native_sched_clock+0x13/0x80
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098676] [<ffffffff8101b7d9>] ? 
sched_clock+0x9/0x10
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098679] [<ffffffff8109d2d5>] ? 
sched_clock_cpu+0xb5/0x100
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098681] [<ffffffff8109df6d>] ? 
vtime_common_task_switch+0x3d/0x40
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098684] [<ffffffff810838a2>] 
process_one_work+0x182/0x450
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098686] [<ffffffff81084641>] 
worker_thread+0x121/0x410
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098688] [<ffffffff81084520>] ? 
rescuer_thread+0x3e0/0x3e0
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098690] [<ffffffff8108b312>] 
kthread+0xd2/0xf0
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098692] [<ffffffff8108b240>] ? 
kthread_create_on_node+0x1d0/0x1d0
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098695] [<ffffffff8172637c>] 
ret_from_fork+0x7c/0xb0
Dec 18 18:25:10 rack3-client-2 kernel: [157126.098697] [<ffffffff8108b240>] ? 
kthread_create_on_node+0x1d0/0x1d0

(3)  BUG_ON(!list_empty(&req->r_req_lru_item));

Dec 4 17:14:33 rack6-ramp-4 kernel: [320359.828209] kernel BUG at 
/build/buildd/linux-3.13.0/net/ceph/osd_client.c:892!
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.043935] Call Trace:
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.097630] [<ffffffffa062d9e8>] 
con_work+0x298/0x640 [libceph]
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.152461] [<ffffffff810838a2>] 
process_one_work+0x182/0x450
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.206653] [<ffffffff81084641>] 
worker_thread+0x121/0x410
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.259860] [<ffffffff81084520>] ? 
rescuer_thread+0x3e0/0x3e0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.312023] [<ffffffff8108b312>] 
kthread+0xd2/0xf0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.362974] [<ffffffff8108b240>] ? 
kthread_create_on_node+0x1d0/0x1d0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.414058] [<ffffffff8172637c>] 
ret_from_fork+0x7c/0xb0
Dec 4 17:14:33 rack6-ramp-4 kernel: [320361.464358] [<ffffffff8108b240>] ? 
kthread_create_on_node+0x1d0/0x1d0

(4) img_request null
Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865] Assertion failure in 
rbd_img_obj_callback() at line 2127:
Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865]
Dec 12 08:07:48 rack1-ram-6 kernel: [251596.908865]     rbd_assert(img_request 
!= NULL);

Dec 12 08:07:50 rack1-ram-6 kernel: [251597.257322]  [<ffffffffa01a5897>] 
rbd_obj_request_complete+0x27/0x70 [rbd]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.268450]  [<ffffffffa01a8d4f>] 
rbd_osd_req_callback+0xdf/0x4e0 [rbd]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.279182]  [<ffffffffa039e262>] 
dispatch+0x4a2/0x900 [libceph]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.289159]  [<ffffffffa039494b>] 
try_read+0x4ab/0x10d0 [libceph]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.299236]  [<ffffffffa0396362>] ? 
try_write+0xa42/0xe30 [libceph]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.309777]  [<ffffffff8101b7d9>] ? 
sched_clock+0x9/0x10
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.318627]  [<ffffffff8101b763>] ? 
native_sched_clock+0x13/0x80
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.332347]  [<ffffffff8101b7d9>] ? 
sched_clock+0x9/0x10
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.341095]  [<ffffffff8109d2d5>] ? 
sched_clock_cpu+0xb5/0x100
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.351061]  [<ffffffffa0396809>] 
con_work+0xb9/0x640 [libceph]
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.361003]  [<ffffffff810838a2>] 
process_one_work+0x182/0x450
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.370752]  [<ffffffff81084641>] 
worker_thread+0x121/0x410
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.379816]  [<ffffffff81084520>] ? 
rescuer_thread+0x3e0/0x3e0
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.389173]  [<ffffffff8108b312>] 
kthread+0xd2/0xf0
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.396898]  [<ffffffff8108b240>] ? 
kthread_create_on_node+0x1d0/0x1d0
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.407506]  [<ffffffff8172637c>] 
ret_from_fork+0x7c/0xb0
Dec 12 08:07:50 rack1-ram-6 kernel: [251597.416181]  [<ffffffff8108b240>] ? 
kthread_create_on_node+0x1d0/0x1d0
This is similar to: http://tracker.ceph.com/issues/8378

Saw that the rhel7a branch has many of the latest fixes and is somewhat 
compatible with 3.13 kernels,
For validation, we have taken the rhel7a ceph-client branch and with minor 
modification gotten it to compile with 3.13.0 headers. With this we did not hit 
any issues (expect issue-2).
We understand that is not the right approach for Ubuntu, It would be great if 
we could get the fixes into Ubuntu 14.04 kernels as well.

Regards,
Chaitanya

-----Original Message-----
From: Somnath Roy
Sent: Tuesday, January 06, 2015 2:38 AM
To: Ilya Dryomov
Cc: Chaitanya Huilgol; [email protected]
Subject: RE: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

It's happening both in idle and under load.
I don't have the trace right now but will get you one soon.

Thanks & Regards
Somnath

-----Original Message-----
From: Ilya Dryomov [mailto:[email protected]]
Sent: Monday, January 05, 2015 12:34 PM
To: Somnath Roy
Cc: Chaitanya Huilgol; [email protected]
Subject: Re: Ceph-client branch for Ubuntu 14.04.1 LTS (3.13.0-x kernels)

On Mon, Jan 5, 2015 at 11:01 PM, Somnath Roy <[email protected]> wrote:
> Ilya,
> Here is the steps..
>
> 1. You have a cluster (3 nodes) and replication is 3
>
> 2. map krbd image to a client.
>
> 3. Reboot or stop ceph services on one or more nodes
>
> 4. The client with krbd mapped module crashes

Is it idle or under load?

Do you have a trace of the crash?

Thanks,

                Ilya

________________________________

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

Reply via email to