Hi,
We are using DLM for clustering (two nodes, single network interface between 
these nodes). Occasionally in intermittent network failure we are getting 
following kernel crash.
We are running linux  3.0.101 (x86_64) and DLM version is 1.7.

<3>[ 2410.184636] dlm: closing connection to node 33663168
<3>[ 2421.307880] dlm: reject connect from unknown addr
<4>[ 2421.307884] 02 00 00 00 c0 a8 01 02 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00
<0>[ 2421.307951] ------------[ cut here ]------------
<2>[ 2421.307958] kernel BUG at 
/usr/src/packages/BUILD/dlm-1.6.fio/obj/default/lowcomms.c:715!
<0>[ 2421.307961] invalid opcode: 0000 [#1] SMP
<4>[ 2421.307963] CPU 14
<4>[ 2421.307964] Modules linked in: ipmi_si scst_vdisk(PFN) raid0 ip6t_REJECT 
ip6t_LOG nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT 
xt_tcpudp xt_pkttype ipt_LOG xt_limit xt_state iptable_raw ip6table_mangle 
nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_conntrack 
nf_defrag_ipv4 ip6table_filter ip6_tables drbd(FN) mlx4_en qla2x00tgt(FN) 
qla2xxx_scst(FN) scsi_transport_fc scsi_tgt scst(PFN) scst_mods(FN) dlm(FN) 
configfs sctp(F) mlx4_core mpt3sas mpt2sas scsi_transport_sas raid_class mptctl 
mptbase iptable_filter ip_tables x_tables ipmi_devintf ipmi_msghandler 
dell_rbu(X) af_packet mperf binfmt_misc iomemory_vsl4(PFN) dm_mod sr_mod cdrom 
tg3 shpchp ipv6 ptp usb_storage ahci sg pps_core joydev pci_hotplug libahci 
ipv6_lib acpi_power_meter libata rtc_cmos dcdbas(X) container button wmi usbhid 
hid ttm drm_kms_helper drm i2c_algo_bit sysimgblt sysfillrect i2c_core 
syscopyarea btrfs zlib_deflate crc32c libcrc32c sd_mod crc_t10dif processor 
thermal_sys ehci_hcd hwmon megaraid_sas scsi_dh_alua scsi_dh_emc scsi_dh_rdac 
scsi_dh_hp_sw scsi_dh scsi_mod asix usbnet usbcore usb_common mii [last 
unloaded: ipmi_si]
<4>[ 2421.308022] Supported: No, Proprietary and Unsupported modules are loaded
<4>[ 2421.308024]
<4>[ 2421.308025] Pid: 2599, comm: kworker/u:8 Tainted: PF          NX 
3.0.101-0.15.1.6651.0.PTF-default #1 Dell Inc. PowerEdge R730/0599V5
<4>[ 2421.308029] RIP: 0010:[<ffffffffa0712346>]  [<ffffffffa0712346>] 
receive_from_sock+0x376/0x380 [dlm]
<4>[ 2421.308038] RSP: 0018:ffff882f54067d40  EFLAGS: 00010246
<4>[ 2421.308039] RAX: 0000000000000158 RBX: ffff882ef3cacd98 RCX: 
0000000000000000
<4>[ 2421.308041] RDX: 0000000000000158 RSI: ffff882f3797c480 RDI: 
ffffffffa06e089d
<4>[ 2421.308043] RBP: ffff882f54067d90 R08: ffff882f54067ae0 R09: 
ffff882f54067d50
<4>[ 2421.308044] R10: ffffffffa06f7260 R11: ffffffff8120f180 R12: 
0000000000000158
<4>[ 2421.308046] R13: ffff882f54067d50 R14: 0000000000001000 R15: 
ffff882ef3cacda8
<4>[ 2421.308048] FS:  0000000000000000(0000) GS:ffff88307f4e0000(0000) 
knlGS:0000000000000000
<4>[ 2421.308050] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>[ 2421.308053] CR2: 00007f3a7bb473b0 CR3: 0000002df7c6e000 CR4: 
00000000001407e0
<4>[ 2421.308055] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
<4>[ 2421.308057] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
<4>[ 2421.308059] Process kworker/u:8 (pid: 2599, threadinfo ffff882f54066000, 
task ffff882f55360140)
<0>[ 2421.308060] Stack:
<4>[ 2421.308061]  ffff882f54066010 0000000000011800 0000000000000000 
0000000000000000
<4>[ 2421.308069]  ffff882f54067dc0 0000000000000002 ffff882f54067dc0 
0000000000000000
<4>[ 2421.308072]  0000000000000080 ffff882f5dcb4180 0000000000000030 
0000000100000084
<0>[ 2421.308075] Call Trace:
<4>[ 2421.308108]  [<ffffffffa071052e>] process_recv_sockets+0x1e/0x30 [dlm]
<4>[ 2421.308125]  [<ffffffff8107b9cc>] process_one_work+0x16c/0x350
<4>[ 2421.308133]  [<ffffffff8107e5fa>] worker_thread+0x17a/0x410
<4>[ 2421.308138]  [<ffffffff81082966>] kthread+0x96/0xa0
<4>[ 2421.308143]  [<ffffffff81469ee4>] kernel_thread_helper+0x4/0x10
<0>[ 2421.308146] Code: 00 88 ff ff 49 c1 e5 0c 49 8d 74 05 00 31 c0 e8 8b bf 
d4 e0 4c 89 ff e8 29 d5 d4 e0 31 f6 48 89 df e8 af e9 ff ff e9 04 ff ff ff <0f> 
0b eb fe 66 0f 1f 44 00 00 48 81 ec e8 00 00 00 31 ff be 50
<1>[ 2421.308170] RIP  [<ffffffffa0712346>] receive_from_sock+0x376/0x380 [dlm]
<4>[ 2421.308175]  RSP <ffff882f54067d40>


The line number lowcomms.c:715 points to the receive_from_sock() function, 
where nodeid is  being checked (and the received message is not a SCTP event).

644 /* Data received from remote end */
645 static int receive_from_sock(struct connection *con)
646 {
....
....
704
705         /* Process SCTP notifications */
706         if (msg.msg_flags & MSG_NOTIFICATION) {
707                 msg.msg_control = incmsg;
708                 msg.msg_controllen = sizeof(incmsg);
709
710                 process_sctp_notification(con, &msg,
711                                 page_address(con->rx_page) + con->cb.base);
712                 mutex_unlock(&con->sock_mutex);
713                 return 0;
714         }
715         BUG_ON(con->nodeid == 0);


I am fairly new when it comes to understanding DLM code. We are using SCTP 
protocol. If I understood correctly, nodeid = 0 points to the base connection 
(associated with the listener socket). The function receive_from_sock() has an 
assumption that if MSG_NOTIFICATION flag is not set, it got to be a peeled 
socket (which has associated nodeid > 0).  And vice versa - if MSG_NOTIFICATION 
flag is set, it is listener socket with nodeid = 0.
But when process_sctp_notification() rejects a SCTP event message due addr to 
nodeid mismatch (ie. dlm_addr-to_nodeid function returns non-zero), the 
function returns without peeling off a new socket.
The code is shown below, where the function is returning from line number 579. 
And the socket is peeled off at line number 588.
As the socket peeling off is not done, it is possible for listener socket 
receiving ordinary data (which was meant for peeled socket) from the connection 
where client already send some data (I am assuming client already sent this 
data before the socket is shutdown at server end). And if listener socket 
receives ordinary data,  DLM is going to hit the "BUG_ON()" at lowcomms.c:715.

Please let me know if my analysis is correct.

517 static void process_sctp_notification(struct connection *con,
518                                       struct msghdr *msg, char *buf)
....
....
570                         make_sockaddr(&prim.ssp_addr, 0, &addr_len);
571                         if (dlm_addr_to_nodeid(&prim.ssp_addr, &nodeid)) {
572                                 int i;
573                                 unsigned char *b=(unsigned char 
*)&prim.ssp_addr;
574                                 log_print("reject connect from unknown 
addr");
575                                 for (i=0; i<sizeof(struct 
sockaddr_storage);i++)
576                                         printk("%02x ", b[i]);
577                                 printk("\n");
578                                 sctp_send_shutdown(prim.ssp_assoc_id);
579                                 return;
580                         }
581
582                         new_con = nodeid2con(nodeid, GFP_NOFS);
583                         if (!new_con)
584                                 return;
585
586                         /* Peel off a new sock */
587                         sctp_lock_sock(con->sock->sk);
588                         ret = sctp_do_peeloff(con->sock->sk,
589                                 sn->sn_assoc_change.sac_assoc_id,
590                                 &new_con->sock);
591                         sctp_release_sock(con->sock->sk);






________________________________

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

Reply via email to