Hi Guozhonghua can you try pulling following two fixes from mainline to fix your problem ?
1. ocfs2: o2net: set tcp user timeout to max value 8e9801dfe37c9e68cdbfcd15988df2187191864e 2. ocfs2: o2net: don't shutdown connection when idle timeout c43c363def04cdaed0d9e26dae846081f55714e7 The above will not close the socket connection in case of network hiccup and should avoid lost of messages. Thanks, --Srini On 09/26/2014 05:06 AM, Guozhonghua wrote: > Hi, all, > > As we use OCFS2, the network is not good. > When the converting request message can’t send to the another node, there > will be a node hangs up which will still waiting for the dlm. > > CAS2/logdir/var/log/syslog.1-6778-Sep 16 20:57:16 CAS2 kernel: > [516366.623623] o2net: Connection to node CAS1 (num 1) at 10.172.254.1:7100 > has been idle for 30.87 secs, shutting it down. > CAS2/logdir/var/log/syslog.1-6779-Sep 16 20:57:16 CAS2 kernel: > [516366.623631] o2net_idle_timer 1621: Local and remote node is heartbeating, > and try connect > CAS2/logdir/var/log/syslog.1-6780-Sep 16 20:57:16 CAS2 kernel: > [516366.623792] o2net: No longer connected to node CAS1 (num 1) at > 10.172.254.1:7100 > CAS2/logdir/var/log/syslog.1:6781:Sep 16 20:57:16 CAS2 kernel: > [516366.623881] (dlm_thread,5140,4):dlm_send_proxy_ast_msg:482 ERROR: > B258FD07DDD64710B68EB9683FD7D1B9: res M00000000000000046e011700000000, error > -112 send AST to node 1 > CAS2/logdir/var/log/syslog.1-6782-Sep 16 20:57:16 CAS2 kernel: > [516366.623900] (dlm_thread,5140,4):dlm_flush_asts:596 ERROR: status = -112 > CAS2/logdir/var/log/syslog.1-6783-Sep 16 20:57:16 CAS2 kernel: > [516366.623937] (dlm_thread,5140,4):dlm_send_proxy_ast_msg:482 ERROR: > B258FD07DDD64710B68EB9683FD7D1B9: res M000000000000001626011000000000, error > -107 send AST to node 1 > CAS2/logdir/var/log/syslog.1-6784-Sep 16 20:57:16 CAS2 kernel: > [516366.623946] (dlm_thread,5140,4):dlm_flush_asts:596 ERROR: status = -107 > CAS2/logdir/var/log/syslog.1-6785-Sep 16 20:57:16 CAS2 kernel: > [516366.623997] Connect node 1 OK, and set timeout 0 > CAS2/logdir/var/log/syslog.1-6786-Sep 16 20:57:17 CAS2 kernel: > [516367.623592] o2net: Connected to node CAS1 (num 1) at 10.172.254.1:7100 > > debugfs: fs_locks -B > Lockres: M00000000000000046e011700000000 Mode: Protected Read > Flags: Initialized Attached Busy > RO Holders: 0 EX Holders: 0 > Pending Action: Convert Pending Unlock Action: None > Requested Mode: Exclusive Blocking Mode: No Lock > PR > Gets: 318317 Fails: 0 Waits (usec) Total: 128622 Max: 3 > EX > Gets: 706878 Fails: 0 Waits (usec) Total: 284967 Max: 2 > Disk Refreshes: 0 > > debugfs: dlm_locks M00000000000000046e011700000000 > Lockres: M00000000000000046e011700000000 Owner: 2 State: 0x0 > Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No > Refs: 4 Locks: 2 On Lists: None > Reference Map: 1 > Lock-Queue Node Level Conv Cookie Refs AST BAST > Pending-Action > Granted 1 PR -1 1:195 2 No No None > Converting 2 PR EX 2:196 2 No No None > > We reviews the code, and want to resend the dlm message to avoid it. > > The patch is required reviewing. > The patch has been test when the network interface is shut down and up > manually to recreate the issue. > If the TCP channel between two node set up within 5 seconds, resend msg works > well. > > We are forward to appreciate another better way to avoid it. > Thanks. > > > --- ocfs2/dlm/dlmthread.c 2014-06-07 10:40:09.000000000 +0800 > +++ ocfs2/dlm/dlmthread.c 2014-09-26 16:42:36.000000000 +0800 > @@ -517,6 +517,9 @@ static void dlm_flush_asts(struct dlm_ct > struct dlm_lock_resource *res; > u8 hi; > > + /* resend the msg again */ > + int send_times = 0; > + > spin_lock(&dlm->ast_lock); > while (!list_empty(&dlm->pending_asts)) { > lock = list_entry(dlm->pending_asts.next, > @@ -539,9 +542,16 @@ static void dlm_flush_asts(struct dlm_ct > spin_unlock(&dlm->ast_lock); > > if (lock->ml.node != dlm->node_num) { > - ret = dlm_do_remote_ast(dlm, res, lock); > - if (ret < 0) > + ret = dlm_do_remote_ast(dlm, res, lock); > + if (ret < 0) { > mlog_errno(ret); > + while ((ret == -112 || ret == -107) && > send_times++ < 5 ) { > + msleep(1000); > + ret = dlm_do_remote_ast(dlm, res, lock); > + mlog(ML_NOTICE, "AST message retry send > again, %d code, send_time = %d\n", ret, send_times); > + } > + send_times = 0; > + } > } else > dlm_do_local_ast(dlm, res, lock); > > @@ -592,8 +602,15 @@ static void dlm_flush_asts(struct dlm_ct > > if (lock->ml.node != dlm->node_num) { > ret = dlm_send_proxy_bast(dlm, res, lock, hi); > - if (ret < 0) > + if (ret < 0) { > mlog_errno(ret); > + while ((ret == -112 || ret == -107) && > send_times++ < 5 ) { > + msleep(1000); > + ret = dlm_send_proxy_bast(dlm, res, lock, > hi);; > + mlog(ML_NOTICE, "BAST message retry send > again, %d code, send_time = %d\n", ret, send_times); > + } > + send_times = 0; > + } > } else > dlm_do_local_bast(dlm, res, lock, hi); > > > ------------------------------------------------------------------------------------------------------------------------------------- > 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出 > 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、 > 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本 > 邮件! > This e-mail and its attachments contain confidential information from H3C, > which is > intended only for the person or entity whose address is listed above. Any use > of the > information contained herein in any way (including, but not limited to, total > or partial > disclosure, reproduction, or dissemination) by persons other than the intended > recipient(s) is prohibited. If you receive this e-mail in error, please > notify the sender > by phone or email immediately and delete it! > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-users _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users