Hello Junxiao,
Thank for quick reply, the information is very helpful. -Gang >>> > On 10/12/2017 02:37 PM, Gang He wrote: >> Hello list, >> >> We got a o2cb DLM problem from the customer, which is using o2cb stack for > OCFS2 file system on SLES12SP1(3.12.49-11-default). >> The problem description is as below, >> >> Customer has three node oracle rack cluster >> gal7gblr2084 >> gal7gblr2085 >> gal7gblr2086 >> >> On each node they have configured two ocfs resources as a filesystem. The > two node gal7gblr2085 and gal7gblr2086 got hung and went into loop to kill > each other and they want root cause analysis. >> Anyway, all I see in logs is those messages flooding /var/log/messages >> >> 2017-10-05T06:50:25.980773+01:00 gal7gblr2085 kernel: [16874541.314199] >> o2net: > Connection to node gal7gblr2086 (num 2) at 10.233.217.12:7777 has been idle > for 30.5 secs, shutting it down. > Looks it is an old kernel. Shutting down connection when idle timeout > will cause losing dlm message which may cause hung. Please apply the > following 3 patches. > > 8c7b638cece1 ocfs2: quorum: add a log for node not fenced > 8e9801dfe37c ocfs2: o2net: set tcp user timeout to max value > c43c363def04 ocfs2: o2net: don't shutdown connection when idle timeout > > Thanks, > Junxiao. >> 2017-10-05T06:50:37.456786+01:00 gal7gblr2085 kernel: [16874552.778726] >> o2net: > No longer connected to node gal7gblr2086 (num 2) at 10.233.217.12:7777 >> 2017-10-05T06:50:45.176798+01:00 gal7gblr2085 kernel: [16874560.487834] > (kworker/u64:1,13245,10):dlm_send_remote_convert_request:392 ERROR: Error > -107 > when sending message 504 (key 0x4a68dd81) to node 2 >> 2017-10-05T06:50:45.176812+01:00 gal7gblr2085 kernel: [16874560.487838] >> o2dlm: > Waiting on the death of node 2 in domain 18AE08328428452BA610E7BDE26F5246 >> 2017-10-05T06:50:50.284796+01:00 gal7gblr2085 kernel: [16874565.589996] > (kworker/u64:1,13245,10):dlm_send_remote_convert_request:392 ERROR: Error > -107 > when sending message 504 (key 0x4a68dd81) to node 2 >> 2017-10-05T06:50:50.284811+01:00 gal7gblr2085 kernel: [16874565.590000] >> o2dlm: > Waiting on the death of node 2 in domain 18AE08328428452BA610E7BDE26F5246 >> 2017-10-05T06:50:55.400808+01:00 gal7gblr2085 kernel: [16874570.700448] > (kworker/u64:1,13245,10):dlm_send_remote_convert_request:392 ERROR: Error > -107 > when sending message 504 (key 0x4a68dd81) to node 2 >> 2017-10-05T06:50:55.400824+01:00 gal7gblr2085 kernel: [16874570.700452] >> o2dlm: > Waiting on the death of node 2 in domain 18AE08328428452BA610E7BDE26F5246 >> 2017-10-05T06:51:00.512766+01:00 gal7gblr2085 kernel: [16874575.808944] > (kworker/u64:1,13245,26):dlm_send_remote_convert_request:392 ERROR: Error > -107 > when sending message 504 (key 0x4a68dd81) to node 2 >> 2017-10-05T06:51:00.512783+01:00 gal7gblr2085 kernel: [16874575.808948] >> o2dlm: > Waiting on the death of node 2 in domain 18AE08328428452BA610E7BDE26F5246 >> 2017-10-05T06:51:02.456785+01:00 gal7gblr2085 kernel: [16874577.749286] > (ora_diag_rcp2,24339,0):dlm_do_master_request:1344 ERROR: link to 2 went > down! >> 2017-10-05T06:51:02.456797+01:00 gal7gblr2085 kernel: [16874577.749289] > (ora_diag_rcp2,24339,0):dlm_get_lock_resource:929 ERROR: status = -107 >> 2017-10-05T06:51:05.632955+01:00 gal7gblr2085 kernel: [16874580.920124] > (kworker/u64:1,13245,26):dlm_send_remote_convert_request:392 ERROR: Error > -107 > when sending message 504 (key 0x4a68dd81) to node 2 >> 2017-10-05T06:51:05.632973+01:00 gal7gblr2085 kernel: [16874580.920132] >> o2dlm: > Waiting on the death of node 2 in domain 18AE08328428452BA610E7BDE26F5246 >> 2017-10-05T06:51:07.976787+01:00 gal7gblr2085 kernel: [16874583.262561] >> o2net: > No connection established with node 2 after 30.0 seconds, giving up. >> 2017-10-05T10:03:38.439542+01:00 gal7gblr2084 kernel: [1911889.097543] > (mdb_psp0_-mgmtd,21126,0):dlm_send_remote_unlock_request:358 ERROR: Error > -107 > when sending message 506 (key 0x4a68dd81) to node 1 >> 2017-10-05T10:03:38.439543+01:00 gal7gblr2084 kernel: [1911889.097547] > (mdb_psp0_-mgmtd,21126,0):dlm_send_remote_unlock_request:358 ERROR: Error > -107 > when sending message 506 (key 0x4a68dd81) to node 1 >> >> >> Did you guys encounter such problem when using o2cb stack? since we mainly > focus on pmck stack, but I still want to help this customer to know the root > cause. >> >> >> Thanks >> Gang >> >> >> >> >> >> _______________________________________________ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel