Re: [Ocfs2-users] Ocfs2 clients hang

2015-12-28 Thread gjprabu
Hi Joseph,



 Again we are facing same issue. Please find the logs when the problem 
occurred.



Dec 27 21:45:44 integ-hm5 kernel: (dlm_thread,46268,24):dlm_update_lvb:206 
getting lvb from lockres for master node

Dec 27 21:45:44 integ-hm5 kernel: (dlm_thread,46268,24):ocfs2_locking_ast:1076 
AST fired for lockres M0084782202, action 1, unlock 0, 
level -1 = 3

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__ocfs2_cluster_lock:1465 
lockres N8340963d, convert from -1 to 3

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_get_lock_resource:724 get 
lockres N8340963d (len 31)

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__dlm_lookup_lockres_full:198 
N8340963d

Dec 27 21:45:44 integ-hm5 kernel: 
(nvfs,91539,0):__dlm_lockres_grab_inflight_ref:663 
A895BC216BE641A8A7E20AA89D57E051: res N8340963d, inflight++: now 1, 
dlm_lockres_grab_inflight_ref()

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock:690 type=3, flags = 0x0

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock:691 creating lock: 
lock=8801824b4500 res=88265dbf2bc0

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock_master:131 type=3

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock_master:148 I can grant 
this lock right away

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__dlm_dirty_lockres:483 
A895BC216BE641A8A7E20AA89D57E051: res N8340963d

Dec 27 21:45:44 integ-hm5 kernel: 
(nvfs,91539,0):dlm_lockres_drop_inflight_ref:684 
A895BC216BE641A8A7E20AA89D57E051: res N8340963d, inflight--: now 0, 
dlmlock()

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__dlm_dirty_lockres:483 
A895BC216BE641A8A7E20AA89D57E051: res N8340963d

Dec 27 21:45:44 integ-hm5 kernel: (dlm_thread,46268,24):dlm_flush_asts:541 
A895BC216BE641A8A7E20AA89D57E051: res N8340963d, Flush AST for lock 
5:441609912, type 3, node 5

Dec 27 21:45:44 integ-hm5 kernel: (dlm_thread,46268,24):dlm_do_local_ast:232 
A895BC216BE641A8A7E20AA89D57E051: res N8340963d, lock 5:441609912, 
Local AST

Dec 27 21:45:44 integ-hm5 kernel: (dlm_thread,46268,24):ocfs2_locking_ast:1076 
AST fired for lockres N8340963d, action 1, unlock 0, level -1 = 3

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__ocfs2_cluster_lock:1465 
lockres O0084782204, convert from -1 to 3

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_get_lock_resource:724 get 
lockres O0084782204 (len 31)

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__dlm_lookup_lockres_full:198 
O0084782204

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_get_lock_resource:778 
allocating a new resource

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__dlm_lookup_lockres_full:198 
O0084782204

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_get_lock_resource:789 no 
lockres found, allocated our own: 880717e38780

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__dlm_insert_lockres:187 
A895BC216BE641A8A7E20AA89D57E051: Hash res O0084782204

Dec 27 21:45:44 integ-hm5 kernel: 
(nvfs,91539,0):__dlm_lockres_grab_inflight_ref:663 
A895BC216BE641A8A7E20AA89D57E051: res O0084782204, 
inflight++: now 1, dlm_get_lock_resource()

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_do_master_request:1364 
node 1 not master, response=NO

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_do_master_request:1364 
node 2 not master, response=NO

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_do_master_request:1364 
node 3 not master, response=NO

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_do_master_request:1364 
node 4 not master, response=NO

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_wait_for_lock_mastery:1122 
about to master O0084782204 here, this=5

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_do_assert_master:1668 
sending assert master to 1 (O0084782204)

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlm_do_assert_master:1668 
sending assert master to 2 (O0084782204)

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlm_do_assert_master:1668 
sending assert master to 3 (O0084782204)

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlm_do_assert_master:1668 
sending assert master to 4 (O0084782204)

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlm_get_lock_resource:968 
A895BC216BE641A8A7E20AA89D57E051: res O0084782204, Mastered 
by 5

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlm_mle_release:436 Releasing 
mle for O0084782204, type 1

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlmlock:690 type=3, flags = 0x0

Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlmlock:691 creating lock: 
lock=8801824b4680 res=880717e38780

Dec 27 21:45:44 integ-hm5 kernel: 

Re: [Ocfs2-users] Ocfs2 clients hang

2015-12-28 Thread Joseph Qi
So which process hangs? And which lockres it is waiting for?
From the log I cannot get those information.

On 2015/12/28 16:46, gjprabu wrote:
> Hi Joseph,
> 
>  Again we are facing same issue. Please find the logs when the 
> problem occurred.
> 
> Dec 27 21:45:44 integ-hm5 kernel: (dlm_thread,46268,24):dlm_update_lvb:206 
> getting lvb from lockres for master node
> Dec 27 21:45:44 integ-hm5 kernel: 
> (dlm_thread,46268,24):ocfs2_locking_ast:1076 AST fired for lockres 
> M0084782202, action 1, unlock 0, level -1 => 3
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__ocfs2_cluster_lock:1465 
> lockres N8340963d, convert from -1 to 3
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_get_lock_resource:724 
> get lockres N8340963d (len 31)
> Dec 27 21:45:44 integ-hm5 kernel: 
> (nvfs,91539,0):__dlm_lookup_lockres_full:198 N8340963d
> Dec 27 21:45:44 integ-hm5 kernel: 
> (nvfs,91539,0):__dlm_lockres_grab_inflight_ref:663 
> A895BC216BE641A8A7E20AA89D57E051: res N8340963d, inflight++: now 1, 
> dlm_lockres_grab_inflight_ref()
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock:690 type=3, flags = 
> 0x0
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock:691 creating lock: 
> lock=8801824b4500 res=88265dbf2bc0
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock_master:131 type=3
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock_master:148 I can 
> grant this lock right away
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__dlm_dirty_lockres:483 
> A895BC216BE641A8A7E20AA89D57E051: res N8340963d
> Dec 27 21:45:44 integ-hm5 kernel: 
> (nvfs,91539,0):dlm_lockres_drop_inflight_ref:684 
> A895BC216BE641A8A7E20AA89D57E051: res N8340963d, inflight--: now 0, 
> dlmlock()
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__dlm_dirty_lockres:483 
> A895BC216BE641A8A7E20AA89D57E051: res N8340963d
> Dec 27 21:45:44 integ-hm5 kernel: (dlm_thread,46268,24):dlm_flush_asts:541 
> A895BC216BE641A8A7E20AA89D57E051: res N8340963d, Flush AST for lock 
> 5:441609912, type 3, node 5
> Dec 27 21:45:44 integ-hm5 kernel: (dlm_thread,46268,24):dlm_do_local_ast:232 
> A895BC216BE641A8A7E20AA89D57E051: res N8340963d, lock 5:441609912, 
> Local AST
> Dec 27 21:45:44 integ-hm5 kernel: 
> (dlm_thread,46268,24):ocfs2_locking_ast:1076 AST fired for lockres 
> N8340963d, action 1, unlock 0, level -1 => 3
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__ocfs2_cluster_lock:1465 
> lockres O0084782204, convert from -1 to 3
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_get_lock_resource:724 
> get lockres O0084782204 (len 31)
> Dec 27 21:45:44 integ-hm5 kernel: 
> (nvfs,91539,0):__dlm_lookup_lockres_full:198 O0084782204
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_get_lock_resource:778 
> allocating a new resource
> Dec 27 21:45:44 integ-hm5 kernel: 
> (nvfs,91539,0):__dlm_lookup_lockres_full:198 O0084782204
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_get_lock_resource:789 no 
> lockres found, allocated our own: 880717e38780
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__dlm_insert_lockres:187 
> A895BC216BE641A8A7E20AA89D57E051: Hash res O0084782204
> Dec 27 21:45:44 integ-hm5 kernel: 
> (nvfs,91539,0):__dlm_lockres_grab_inflight_ref:663 
> A895BC216BE641A8A7E20AA89D57E051: res O0084782204, 
> inflight++: now 1, dlm_get_lock_resource()
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_do_master_request:1364 
> node 1 not master, response=NO
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_do_master_request:1364 
> node 2 not master, response=NO
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_do_master_request:1364 
> node 3 not master, response=NO
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_do_master_request:1364 
> node 4 not master, response=NO
> Dec 27 21:45:44 integ-hm5 kernel: 
> (nvfs,91539,0):dlm_wait_for_lock_mastery:1122 about to master 
> O0084782204 here, this=5
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_do_assert_master:1668 
> sending assert master to 1 (O0084782204)
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlm_do_assert_master:1668 
> sending assert master to 2 (O0084782204)
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlm_do_assert_master:1668 
> sending assert master to 3 (O0084782204)
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlm_do_assert_master:1668 
> sending assert master to 4 (O0084782204)
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlm_get_lock_resource:968 
> A895BC216BE641A8A7E20AA89D57E051: res O0084782204, 
> Mastered by 5
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlm_mle_release:436 
> Releasing mle for 

Re: [Ocfs2-users] Ocfs2 clients hang

2015-12-28 Thread gjprabu
Joseph,





Do you feel anything like kernel issue in below logs. After certain 
point of time no dlm logs found.

 



 
Dec 27 22:21:22 integ-hm8 kernel: (ocfs2rec,68213,10):dlmconvert_remote:270 
type=0, convert_type=-1, busy=0

Dec 27 22:21:22 integ-hm8 kernel: (ocfs2rec,68213,10):dlmconvert_remote:275 
bailing out early since res is RECOVERING on secondary queue

Dec 27 22:21:22 integ-hm8 kernel: (ocfs2rec,68213,10):dlmlock:652 retrying 
convert with migration/recovery/in-progress

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,17):dlm_send_one_lockres:1289 sending to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,17):dlm_send_mig_lockres_msg:1138 
A895BC216BE641A8A7E20AA89D57E051:M008ec64cbc: sending mig 
lockres (recovery) to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138 
A895BC216BE641A8A7E20AA89D57E051:O006e7df720: sending mig 
lockres (recovery) to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138 
A895BC216BE641A8A7E20AA89D57E051:M006e7df726: sending mig 
lockres (recovery) to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138 
A895BC216BE641A8A7E20AA89D57E051:O006e7df726: sending mig 
lockres (recovery) to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138 
A895BC216BE641A8A7E20AA89D57E051:M006e7df729: sending mig 
lockres (recovery) to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,9):dlm_send_mig_lockres_msg:1138 
A895BC216BE641A8A7E20AA89D57E051:O006e7df729: sending mig 
lockres (recovery) to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,9):dlm_send_mig_lockres_msg:1138 
A895BC216BE641A8A7E20AA89D57E051:M006e7df72c: sending mig 
lockres (recovery) to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,9):dlm_send_mig_lockres_msg:1138 
A895BC216BE641A8A7E20AA89D57E051:O006e7df72c: sending mig 
lockres (recovery) to 2

Dec 27 22:21:22 integ-hm8 kernel: 
(kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2

Dec 27 22:55:52 integ-hm8 rsyslogd: [origin software="rsyslogd" 
swVersion="7.4.7" x-pid="1331" x-info="http://www.rsyslog.com;] exiting on 
signal 15.

Dec 27 23:05:25 integ-hm8 rsyslogd: [origin software="rsyslogd" 
swVersion="7.4.7" x-pid="1346" x-info="http://www.rsyslog.com;] start

Dec 27 23:05:08 integ-hm8 journal: Runtime journal is using 8.0M (max 4.0G, 
leaving 4.0G of free 251.9G, current limit 4.0G).

Dec 27 23:05:08 integ-hm8 kernel: Initializing cgroup subsys cpuset

Dec 27 23:05:08 integ-hm8 kernel: Initializing cgroup subsys cpu

Dec 27 23:05:08 integ-hm8 kernel: Initializing cgroup subsys cpuacct

Dec 27 23:05:08 integ-hm8 kernel: Linux version 3.10.91 (root@integ-hm8) (gcc 
version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Oct 29 11:52:34 IST 
2015

Dec 27 23:05:08 integ-hm8 kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.91 
root=UUID=20d0873d-8f3e-455a-ba67-f8f336b5e9a7 ro crashkernel=auto rhgb quiet 
LANG=en_US.UTF-8 systemd.debug

Dec 27 23:05:08 integ-hm8 kernel: e820: BIOS-provided physical RAM map:

Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem 
0x-0x0009bfff] usable

Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem 
0x0010-0xbd2c] usable

Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem 
0xbd2d-0xbd2fbfff] reserved

Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem 
0xbd2fc000-0xbd35afff] ACPI data

Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem 
0xbd35b000-0xbfff] reserved

Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem 
0xe000-0xefff] reserved

Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem 
0xfe00-0x] reserved

Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem 
0x0001-0x00803fff] usable

Dec 27 23:05:08 integ-hm8 kernel: NX (Execute Disable) protection: active

Dec 27 23:05:08 integ-hm8 kernel: SMBIOS 2.7 present.

Dec 27 23:05:08 integ-hm8 kernel: No AGP 

Re: [Ocfs2-users] Ocfs2 clients hang

2015-12-28 Thread gjprabu
Yes, its got hanged all 5 nodes, after restart everything fine


Regards

Prabu









 On Mon, 28 Dec 2015 15:00:57 +0530 Joseph Qi 
joseph...@huawei.comwrote  




So which process hangs? And which lockres it is waiting for? 

>From the log I cannot get those information. 

 

On 2015/12/28 16:46, gjprabu wrote: 

 Hi Joseph, 

 

 Again we are facing same issue. Please find the logs when the problem 
occurred. 

 

 Dec 27 21:45:44 integ-hm5 kernel: (dlm_thread,46268,24):dlm_update_lvb:206 
getting lvb from lockres for master node 

 Dec 27 21:45:44 integ-hm5 kernel: 
(dlm_thread,46268,24):ocfs2_locking_ast:1076 AST fired for lockres 
M0084782202, action 1, unlock 0, level -1 = 3 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__ocfs2_cluster_lock:1465 
lockres N8340963d, convert from -1 to 3 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_get_lock_resource:724 
get lockres N8340963d (len 31) 

 Dec 27 21:45:44 integ-hm5 kernel: 
(nvfs,91539,0):__dlm_lookup_lockres_full:198 N8340963d 

 Dec 27 21:45:44 integ-hm5 kernel: 
(nvfs,91539,0):__dlm_lockres_grab_inflight_ref:663 
A895BC216BE641A8A7E20AA89D57E051: res N8340963d, inflight++: now 1, 
dlm_lockres_grab_inflight_ref() 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock:690 type=3, flags 
= 0x0 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock:691 creating 
lock: lock=8801824b4500 res=88265dbf2bc0 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock_master:131 type=3 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock_master:148 I can 
grant this lock right away 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__dlm_dirty_lockres:483 
A895BC216BE641A8A7E20AA89D57E051: res N8340963d 

 Dec 27 21:45:44 integ-hm5 kernel: 
(nvfs,91539,0):dlm_lockres_drop_inflight_ref:684 
A895BC216BE641A8A7E20AA89D57E051: res N8340963d, inflight--: now 0, 
dlmlock() 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__dlm_dirty_lockres:483 
A895BC216BE641A8A7E20AA89D57E051: res N8340963d 

 Dec 27 21:45:44 integ-hm5 kernel: (dlm_thread,46268,24):dlm_flush_asts:541 
A895BC216BE641A8A7E20AA89D57E051: res N8340963d, Flush AST for lock 
5:441609912, type 3, node 5 

 Dec 27 21:45:44 integ-hm5 kernel: 
(dlm_thread,46268,24):dlm_do_local_ast:232 A895BC216BE641A8A7E20AA89D57E051: 
res N8340963d, lock 5:441609912, Local AST 

 Dec 27 21:45:44 integ-hm5 kernel: 
(dlm_thread,46268,24):ocfs2_locking_ast:1076 AST fired for lockres 
N8340963d, action 1, unlock 0, level -1 = 3 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__ocfs2_cluster_lock:1465 
lockres O0084782204, convert from -1 to 3 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_get_lock_resource:724 
get lockres O0084782204 (len 31) 

 Dec 27 21:45:44 integ-hm5 kernel: 
(nvfs,91539,0):__dlm_lookup_lockres_full:198 O0084782204 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_get_lock_resource:778 
allocating a new resource 

 Dec 27 21:45:44 integ-hm5 kernel: 
(nvfs,91539,0):__dlm_lookup_lockres_full:198 O0084782204 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_get_lock_resource:789 
no lockres found, allocated our own: 880717e38780 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):__dlm_insert_lockres:187 
A895BC216BE641A8A7E20AA89D57E051: Hash res O0084782204 

 Dec 27 21:45:44 integ-hm5 kernel: 
(nvfs,91539,0):__dlm_lockres_grab_inflight_ref:663 
A895BC216BE641A8A7E20AA89D57E051: res O0084782204, 
inflight++: now 1, dlm_get_lock_resource() 

 Dec 27 21:45:44 integ-hm5 kernel: 
(nvfs,91539,0):dlm_do_master_request:1364 node 1 not master, response=NO 

 Dec 27 21:45:44 integ-hm5 kernel: 
(nvfs,91539,0):dlm_do_master_request:1364 node 2 not master, response=NO 

 Dec 27 21:45:44 integ-hm5 kernel: 
(nvfs,91539,0):dlm_do_master_request:1364 node 3 not master, response=NO 

 Dec 27 21:45:44 integ-hm5 kernel: 
(nvfs,91539,0):dlm_do_master_request:1364 node 4 not master, response=NO 

 Dec 27 21:45:44 integ-hm5 kernel: 
(nvfs,91539,0):dlm_wait_for_lock_mastery:1122 about to master 
O0084782204 here, this=5 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlm_do_assert_master:1668 
sending assert master to 1 (O0084782204) 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlm_do_assert_master:1668 
sending assert master to 2 (O0084782204) 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlm_do_assert_master:1668 
sending assert master to 3 (O0084782204) 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlm_do_assert_master:1668 
sending assert master to 4 (O0084782204) 

 Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlm_get_lock_resource:968 
A895BC216BE641A8A7E20AA89D57E051: res 

Re: [Ocfs2-users] Ocfs2 clients hang

2015-12-28 Thread Joseph Qi
If system hangs, you should figure out which process as well as its
stack before restarting the system.

On 2015/12/28 20:16, gjprabu wrote:
> Joseph,
> 
> 
> Do you feel anything like kernel issue in below logs. After certain 
> point of time no dlm logs found.
>  
> 
>  
> Dec 27 22:21:22 integ-hm8 kernel: (ocfs2rec,68213,10):dlmconvert_remote:270 
> type=0, convert_type=-1, busy=0
> Dec 27 22:21:22 integ-hm8 kernel: (ocfs2rec,68213,10):dlmconvert_remote:275 
> bailing out early since res is RECOVERING on secondary queue
> Dec 27 22:21:22 integ-hm8 kernel: (ocfs2rec,68213,10):dlmlock:652 retrying 
> convert with migration/recovery/in-progress
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,17):dlm_send_one_lockres:1289 sending to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,17):dlm_send_mig_lockres_msg:1138 
> A895BC216BE641A8A7E20AA89D57E051:M008ec64cbc: sending mig 
> lockres (recovery) to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138 
> A895BC216BE641A8A7E20AA89D57E051:O006e7df720: sending mig 
> lockres (recovery) to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138 
> A895BC216BE641A8A7E20AA89D57E051:M006e7df726: sending mig 
> lockres (recovery) to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138 
> A895BC216BE641A8A7E20AA89D57E051:O006e7df726: sending mig 
> lockres (recovery) to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138 
> A895BC216BE641A8A7E20AA89D57E051:M006e7df729: sending mig 
> lockres (recovery) to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,9):dlm_send_mig_lockres_msg:1138 
> A895BC216BE641A8A7E20AA89D57E051:O006e7df729: sending mig 
> lockres (recovery) to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,9):dlm_send_mig_lockres_msg:1138 
> A895BC216BE641A8A7E20AA89D57E051:M006e7df72c: sending mig 
> lockres (recovery) to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,9):dlm_send_mig_lockres_msg:1138 
> A895BC216BE641A8A7E20AA89D57E051:O006e7df72c: sending mig 
> lockres (recovery) to 2
> Dec 27 22:21:22 integ-hm8 kernel: 
> (kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2
> Dec 27 22:55:52 integ-hm8 rsyslogd: [origin software="rsyslogd" 
> swVersion="7.4.7" x-pid="1331" x-info="http://www.rsyslog.com;] exiting on 
> signal 15.
> Dec 27 23:05:25 integ-hm8 rsyslogd: [origin software="rsyslogd" 
> swVersion="7.4.7" x-pid="1346" x-info="http://www.rsyslog.com;] start
> Dec 27 23:05:08 integ-hm8 journal: Runtime journal is using 8.0M (max 4.0G, 
> leaving 4.0G of free 251.9G, current limit 4.0G).
> Dec 27 23:05:08 integ-hm8 kernel: Initializing cgroup subsys cpuset
> Dec 27 23:05:08 integ-hm8 kernel: Initializing cgroup subsys cpu
> Dec 27 23:05:08 integ-hm8 kernel: Initializing cgroup subsys cpuacct
> Dec 27 23:05:08 integ-hm8 kernel: Linux version 3.10.91 (root@integ-hm8) (gcc 
> version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Oct 29 11:52:34 
> IST 2015
> Dec 27 23:05:08 integ-hm8 kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.91 
> root=UUID=20d0873d-8f3e-455a-ba67-f8f336b5e9a7 ro crashkernel=auto rhgb quiet 
> LANG=en_US.UTF-8 systemd.debug
> Dec 27 23:05:08 integ-hm8 kernel: e820: BIOS-provided physical RAM map:
> Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem 
> 0x-0x0009bfff] usable
> Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem 
> 0x0010-0xbd2c] usable
> Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem 
> 0xbd2d-0xbd2fbfff] reserved
> Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem 
> 0xbd2fc000-0xbd35afff] ACPI data
> Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem 
> 0xbd35b000-0xbfff] reserved
> Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem 
> 0xe000-0xefff] reserved
> Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem 
> 

Re: [Ocfs2-users] Ocfs2 clients hang

2015-12-22 Thread gjprabu
HI Joseph,



  Our current setup is having below details and DLM is now allowed (DLM 
allow). Do you suggest any other option to get more logs.



debugfs.ocfs2 -l

DLM off  ( DLM allow)

MSG off

TCP off

CONN off

VOTE off

DLM_DOMAIN off

HB_BIO off

BASTS off

DLMFS off

ERROR allow

DLM_MASTER off

KTHREAD off

NOTICE allow

QUORUM off

SOCKET off

DLM_GLUE off

DLM_THREAD off

DLM_RECOVERY off

HEARTBEAT off

CLUSTER off



Regards

Prabu










 On Wed, 23 Dec 2015 07:30:54 +0530 Joseph Qi 
joseph...@huawei.comwrote  




So you mean the four nodes are manually rebooted? If so you must 

analyze messages before you rebooted. 

If there are not enough messages, you can switch on some messages. IMO, 

mostly hang problems are caused by DLM bug, so I suggest switch on DLM 

related log and reproduce. 

You can use debugfs.ocfs2 -l to show all message switches and switch on 

you want. For example, 

# debugfs.ocfs2 -l DLM allow 

 

Thanks, 

Joseph 

 

On 2015/12/22 21:47, gjprabu wrote: 

 Hi Joseph, 

 

 We are facing ocfs2 server hang problem frequently and suddenly 4 nodes 
going to hang stat expect 1 node. After reboot everything is come to normal, 
this behavior happend many times. Do we have any debug and fix for this issue. 

 

 Regards 

 Prabu 

 

 

  On Tue, 22 Dec 2015 16:30:52 +0530 *Joseph Qi 
joseph...@huawei.com*wrote  

 

 Hi Prabu, 

 From the log you provided, I can only see that node 5 disconnected with 

 node 2, 3, 1 and 4. It seemed that something wrong happened on the four 

 nodes, and node 5 did recovery for them. After that, the four nodes 

 joined again. 

 

 On 2015/12/22 16:23, gjprabu wrote: 

  Hi, 

  

  Anybody please help me on this issue. 

  

  Regards 

  Prabu 

  

   On Mon, 21 Dec 2015 15:16:49 +0530 *gjprabu 
gjpr...@zohocorp.com mailto:gjpr...@zohocorp.com*wrote  

  

  Dear Team, 

  

  Ocfs2 clients are getting hang often and unusable. Please find the 
logs. Kindly provide the solution, it will be highly appreciated. 

  

  

  [3659684.042530] o2dlm: Node 4 joins domain 
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes 

  

  [3992993.101490] 
(kworker/u192:1,63211,24):dlm_create_lock_handler:515 ERROR: dlm status = 
DLM_IVLOCKID 

  [3993002.193285] 
(kworker/u192:1,63211,24):dlm_deref_lockres_handler:2267 ERROR: 
A895BC216BE641A8A7E20AA89D57E051:M0062d2dcd0: bad lockres 
name 

  [3993032.457220] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 
ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 2 

  [3993062.547989] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 
ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2 

  [3993064.860776] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2 

  [3993064.860804] o2cb: o2dlm has evicted node 2 from domain 
A895BC216BE641A8A7E20AA89D57E051 

  [3993073.280062] o2dlm: Begin recovery on domain 
A895BC216BE641A8A7E20AA89D57E051 for node 2 

  [3993094.623695] (dlm_thread,46268,8):dlm_send_proxy_ast_msg:484 
ERROR: A895BC216BE641A8A7E20AA89D57E051: res S02, 
error -112 send AST to node 4 

  [3993094.624281] (dlm_thread,46268,8):dlm_flush_asts:605 ERROR: 
status = -112 

  [3993094.687668] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 3 

  [3993094.815662] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 
ERROR: Error -112 when sending message 514 (key 0xc3460ae7) to node 1 

  [3993094.816118] 
(dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -112 

  [3993124.778525] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 
ERROR: Error -107 when sending message 514 (key 0xc3460ae7) to node 3 

  [3993124.779032] 
(dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -107 

  [3993133.332516] o2cb: o2dlm has evicted node 3 from domain 
A895BC216BE641A8A7E20AA89D57E051 

  [3993139.915122] o2cb: o2dlm has evicted node 1 from domain 
A895BC216BE641A8A7E20AA89D57E051 

  [3993147.071956] o2cb: o2dlm has evicted node 4 from domain 
A895BC216BE641A8A7E20AA89D57E051 

  [3993147.071968] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 
ERROR: Error -107 when sending message 514 (key 0xc3460ae7) to node 4 

  [3993147.071975] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 

  [3993147.071997] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 

  [3993147.072001] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 

  [3993147.072005] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 

  

Re: [Ocfs2-users] Ocfs2 clients hang

2015-12-22 Thread Joseph Qi
Please also switch on BASTS and DLM_RECOVERY.

On 2015/12/23 10:11, gjprabu wrote:
> HI Joseph,
> 
>   Our current setup is having below details and DLM is now allowed 
> (DLM allow). Do you suggest any other option to get more logs.
> 
> debugfs.ocfs2 -l
> DLM off  ( DLM allow)
> MSG off
> TCP off
> CONN off
> VOTE off
> DLM_DOMAIN off
> HB_BIO off
> BASTS off
> DLMFS off
> ERROR allow
> DLM_MASTER off
> KTHREAD off
> NOTICE allow
> QUORUM off
> SOCKET off
> DLM_GLUE off
> DLM_THREAD off
> DLM_RECOVERY off
> HEARTBEAT off
> CLUSTER off
> 
> Regards
> Prabu
> **
> 
> 
> 
>  On Wed, 23 Dec 2015 07:30:54 +0530 *Joseph Qi 
> *wrote 
> 
> So you mean the four nodes are manually rebooted? If so you must
> analyze messages before you rebooted.
> If there are not enough messages, you can switch on some messages. IMO,
> mostly hang problems are caused by DLM bug, so I suggest switch on DLM
> related log and reproduce.
> You can use debugfs.ocfs2 -l to show all message switches and switch on
> you want. For example,
> # debugfs.ocfs2 -l DLM allow
> 
> Thanks,
> Joseph
> 
> On 2015/12/22 21:47, gjprabu wrote:
> > Hi Joseph,
> >
> > We are facing ocfs2 server hang problem frequently and suddenly 4 nodes 
> going to hang stat expect 1 node. After reboot everything is come to normal, 
> this behavior happend many times. Do we have any debug and fix for this issue.
> >
> > Regards
> > Prabu
> >
> >
> >  On Tue, 22 Dec 2015 16:30:52 +0530 *Joseph Qi 
> >*wrote 
> >
> > Hi Prabu,
> > From the log you provided, I can only see that node 5 disconnected with
> > node 2, 3, 1 and 4. It seemed that something wrong happened on the four
> > nodes, and node 5 did recovery for them. After that, the four nodes
> > joined again.
> >
> > On 2015/12/22 16:23, gjprabu wrote:
> > > Hi,
> > >
> > > Anybody please help me on this issue.
> > >
> > > Regards
> > > Prabu
> > >
> > >  On Mon, 21 Dec 2015 15:16:49 +0530 *gjprabu 
>  
> >>*wrote 
> > >
> > > Dear Team,
> > >
> > > Ocfs2 clients are getting hang often and unusable. Please find the 
> logs. Kindly provide the solution, it will be highly appreciated.
> > >
> > >
> > > [3659684.042530] o2dlm: Node 4 joins domain 
> A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes
> > >
> > > [3992993.101490] 
> (kworker/u192:1,63211,24):dlm_create_lock_handler:515 ERROR: dlm status = 
> DLM_IVLOCKID
> > > [3993002.193285] 
> (kworker/u192:1,63211,24):dlm_deref_lockres_handler:2267 ERROR: 
> A895BC216BE641A8A7E20AA89D57E051:M0062d2dcd0: bad lockres 
> name
> > > [3993032.457220] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 
> ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 2
> > > [3993062.547989] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2
> > > [3993064.860776] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2
> > > [3993064.860804] o2cb: o2dlm has evicted node 2 from domain 
> A895BC216BE641A8A7E20AA89D57E051
> > > [3993073.280062] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 2
> > > [3993094.623695] (dlm_thread,46268,8):dlm_send_proxy_ast_msg:484 
> ERROR: A895BC216BE641A8A7E20AA89D57E051: res S02, 
> error -112 send AST to node 4
> > > [3993094.624281] (dlm_thread,46268,8):dlm_flush_asts:605 ERROR: 
> status = -112
> > > [3993094.687668] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 3
> > > [3993094.815662] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 
> ERROR: Error -112 when sending message 514 (key 0xc3460ae7) to node 1
> > > [3993094.816118] 
> (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = 
> -112
> > > [3993124.778525] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 
> ERROR: Error -107 when sending message 514 (key 0xc3460ae7) to node 3
> > > [3993124.779032] 
> (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = 
> -107
> > > [3993133.332516] o2cb: o2dlm has evicted node 3 from domain 
> A895BC216BE641A8A7E20AA89D57E051
> > > [3993139.915122] o2cb: o2dlm has evicted node 1 from domain 
> A895BC216BE641A8A7E20AA89D57E051
> > > [3993147.071956] o2cb: o2dlm has evicted node 4 from domain 
> A895BC216BE641A8A7E20AA89D57E051
> > > [3993147.071968] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 
> ERROR: Error -107 when sending 

Re: [Ocfs2-users] Ocfs2 clients hang

2015-12-22 Thread gjprabu
Hi Joseph,



I have enabled requested and Is the DLM log will capture to analyze 
further. Also do we need to enable network side setting for allow max packets.



debugfs.ocfs2 -l


DLM allow

MSG off

TCP off

CONN off

VOTE off

DLM_DOMAIN off

HB_BIO off

BASTS allow

DLMFS off

ERROR allow

DLM_MASTER off

KTHREAD off

NOTICE allow

QUORUM off

SOCKET off

DLM_GLUE off

DLM_THREAD off

DLM_RECOVERY allow

HEARTBEAT off

CLUSTER off



Regards

Prabu





 On Wed, 23 Dec 2015 07:51:38 +0530 Joseph Qi 
joseph...@huawei.comwrote  




Please also switch on BASTS and DLM_RECOVERY. 

 

On 2015/12/23 10:11, gjprabu wrote: 

 HI Joseph, 

 

 Our current setup is having below details and DLM is now allowed (DLM 
allow). Do you suggest any other option to get more logs. 

 

 debugfs.ocfs2 -l 

 DLM off ( DLM allow) 

 MSG off 

 TCP off 

 CONN off 

 VOTE off 

 DLM_DOMAIN off 

 HB_BIO off 

 BASTS off 

 DLMFS off 

 ERROR allow 

 DLM_MASTER off 

 KTHREAD off 

 NOTICE allow 

 QUORUM off 

 SOCKET off 

 DLM_GLUE off 

 DLM_THREAD off 

 DLM_RECOVERY off 

 HEARTBEAT off 

 CLUSTER off 

 

 Regards 

 Prabu 

 ** 

 

 

 

  On Wed, 23 Dec 2015 07:30:54 +0530 *Joseph Qi 
joseph...@huawei.com*wrote  

 

 So you mean the four nodes are manually rebooted? If so you must 

 analyze messages before you rebooted. 

 If there are not enough messages, you can switch on some messages. IMO, 

 mostly hang problems are caused by DLM bug, so I suggest switch on DLM 

 related log and reproduce. 

 You can use debugfs.ocfs2 -l to show all message switches and switch on 

 you want. For example, 

 # debugfs.ocfs2 -l DLM allow 

 

 Thanks, 

 Joseph 

 

 On 2015/12/22 21:47, gjprabu wrote: 

  Hi Joseph, 

  

  We are facing ocfs2 server hang problem frequently and suddenly 4 
nodes going to hang stat expect 1 node. After reboot everything is come to 
normal, this behavior happend many times. Do we have any debug and fix for this 
issue. 

  

  Regards 

  Prabu 

  

  

   On Tue, 22 Dec 2015 16:30:52 +0530 *Joseph Qi 
joseph...@huawei.com mailto:joseph...@huawei.com*wrote  

  

  Hi Prabu, 

  From the log you provided, I can only see that node 5 disconnected 
with 

  node 2, 3, 1 and 4. It seemed that something wrong happened on the 
four 

  nodes, and node 5 did recovery for them. After that, the four nodes 

  joined again. 

  

  On 2015/12/22 16:23, gjprabu wrote: 

   Hi, 

   

   Anybody please help me on this issue. 

   

   Regards 

   Prabu 

   

    On Mon, 21 Dec 2015 15:16:49 +0530 *gjprabu 
gjpr...@zohocorp.com mailto:gjpr...@zohocorp.com; 
mailto:gjpr...@zohocorp.com 
mailto:gjpr...@zohocorp.com*wrote  

   

   Dear Team, 

   

   Ocfs2 clients are getting hang often and unusable. Please find 
the logs. Kindly provide the solution, it will be highly appreciated. 

   

   

   [3659684.042530] o2dlm: Node 4 joins domain 
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes 

   

   [3992993.101490] 
(kworker/u192:1,63211,24):dlm_create_lock_handler:515 ERROR: dlm status = 
DLM_IVLOCKID 

   [3993002.193285] 
(kworker/u192:1,63211,24):dlm_deref_lockres_handler:2267 ERROR: 
A895BC216BE641A8A7E20AA89D57E051:M0062d2dcd0: bad lockres 
name 

   [3993032.457220] 
(kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -112 when 
sending message 502 (key 0xc3460ae7) to node 2 

   [3993062.547989] 
(kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -107 when 
sending message 502 (key 0xc3460ae7) to node 2 

   [3993064.860776] 
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when 
sending message 502 (key 0xc3460ae7) to node 2 

   [3993064.860804] o2cb: o2dlm has evicted node 2 from domain 
A895BC216BE641A8A7E20AA89D57E051 

   [3993073.280062] o2dlm: Begin recovery on domain 
A895BC216BE641A8A7E20AA89D57E051 for node 2 

   [3993094.623695] (dlm_thread,46268,8):dlm_send_proxy_ast_msg:484 
ERROR: A895BC216BE641A8A7E20AA89D57E051: res S02, 
error -112 send AST to node 4 

   [3993094.624281] (dlm_thread,46268,8):dlm_flush_asts:605 ERROR: 
status = -112 

   [3993094.687668] 
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -112 when 
sending message 502 (key 0xc3460ae7) to node 3 

   [3993094.815662] 
(dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -112 when 
sending message 514 (key 0xc3460ae7) to node 1 

   [3993094.816118] 
(dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -112 

   [3993124.778525] 
(dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -107 when 
sending message 514 (key 0xc3460ae7) to node 3 

   [3993124.779032] 
(dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -107 

   [3993133.332516] o2cb: o2dlm has evicted node 3 from domain 
A895BC216BE641A8A7E20AA89D57E051 

   [3993139.915122] o2cb: o2dlm has evicted node 1 from domain 

Re: [Ocfs2-users] Ocfs2 clients hang

2015-12-22 Thread Joseph Qi
So you mean the four nodes are manually rebooted? If so you must
analyze messages before you rebooted.
If there are not enough messages, you can switch on some messages. IMO,
mostly hang problems are caused by DLM bug, so I suggest switch on DLM
related log and reproduce.
You can use debugfs.ocfs2 -l to show all message switches and switch on
you want. For example,
# debugfs.ocfs2 -l DLM allow

Thanks,
Joseph

On 2015/12/22 21:47, gjprabu wrote:
> Hi Joseph,
> 
>   We are facing ocfs2 server hang problem frequently and suddenly 4 
> nodes going to hang stat expect 1 node. After reboot everything is come to 
> normal, this behavior happend many times. Do we have any debug and fix for 
> this issue.
> 
> Regards
> Prabu
> 
> 
>  On Tue, 22 Dec 2015 16:30:52 +0530 *Joseph Qi 
> *wrote 
> 
> Hi Prabu,
> From the log you provided, I can only see that node 5 disconnected with
> node 2, 3, 1 and 4. It seemed that something wrong happened on the four
> nodes, and node 5 did recovery for them. After that, the four nodes
> joined again.
> 
> On 2015/12/22 16:23, gjprabu wrote:
> > Hi,
> >
> > Anybody please help me on this issue.
> >
> > Regards
> > Prabu
> >
> >  On Mon, 21 Dec 2015 15:16:49 +0530 *gjprabu  >*wrote 
> >
> > Dear Team,
> >
> > Ocfs2 clients are getting hang often and unusable. Please find the 
> logs. Kindly provide the solution, it will be highly appreciated.
> >
> >
> > [3659684.042530] o2dlm: Node 4 joins domain 
> A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes
> >
> > [3992993.101490] (kworker/u192:1,63211,24):dlm_create_lock_handler:515 
> ERROR: dlm status = DLM_IVLOCKID
> > [3993002.193285] 
> (kworker/u192:1,63211,24):dlm_deref_lockres_handler:2267 ERROR: 
> A895BC216BE641A8A7E20AA89D57E051:M0062d2dcd0: bad lockres 
> name
> > [3993032.457220] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 
> ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 2
> > [3993062.547989] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2
> > [3993064.860776] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2
> > [3993064.860804] o2cb: o2dlm has evicted node 2 from domain 
> A895BC216BE641A8A7E20AA89D57E051
> > [3993073.280062] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 2
> > [3993094.623695] (dlm_thread,46268,8):dlm_send_proxy_ast_msg:484 ERROR: 
> A895BC216BE641A8A7E20AA89D57E051: res S02, error 
> -112 send AST to node 4
> > [3993094.624281] (dlm_thread,46268,8):dlm_flush_asts:605 ERROR: status 
> = -112
> > [3993094.687668] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 3
> > [3993094.815662] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 
> ERROR: Error -112 when sending message 514 (key 0xc3460ae7) to node 1
> > [3993094.816118] 
> (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = 
> -112
> > [3993124.778525] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 
> ERROR: Error -107 when sending message 514 (key 0xc3460ae7) to node 3
> > [3993124.779032] 
> (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = 
> -107
> > [3993133.332516] o2cb: o2dlm has evicted node 3 from domain 
> A895BC216BE641A8A7E20AA89D57E051
> > [3993139.915122] o2cb: o2dlm has evicted node 1 from domain 
> A895BC216BE641A8A7E20AA89D57E051
> > [3993147.071956] o2cb: o2dlm has evicted node 4 from domain 
> A895BC216BE641A8A7E20AA89D57E051
> > [3993147.071968] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 
> ERROR: Error -107 when sending message 514 (key 0xc3460ae7) to node 4
> > [3993147.071975] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4
> > [3993147.071997] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4
> > [3993147.072001] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4
> > [3993147.072005] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4
> > [3993147.072009] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4
> > [3993147.075019] 
> (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = 
> -107
> > [3993147.075353] 

Re: [Ocfs2-users] Ocfs2 clients hang

2015-12-22 Thread gjprabu
Ok, thanks





 On Wed, 23 Dec 2015 09:08:13 +0530 Joseph Qi 
joseph...@huawei.comwrote  




I don't think there is relation with packet size. 

Once reproduced, you can share the messages and I will try my best if 

free. 

 

On 2015/12/23 10:45, gjprabu wrote: 

 Hi Joseph, 

 

 I have enabled requested and Is the DLM log will capture to analyze 
further. Also do we need to enable network side setting for allow max packets. 

 

 debugfs.ocfs2 -l 

 DLM allow 

 MSG off 

 TCP off 

 CONN off 

 VOTE off 

 DLM_DOMAIN off 

 HB_BIO off 

 BASTS allow 

 DLMFS off 

 ERROR allow 

 DLM_MASTER off 

 KTHREAD off 

 NOTICE allow 

 QUORUM off 

 SOCKET off 

 DLM_GLUE off 

 DLM_THREAD off 

 DLM_RECOVERY allow 

 HEARTBEAT off 

 CLUSTER off 

 

 Regards 

 Prabu 

 

 

  On Wed, 23 Dec 2015 07:51:38 +0530 *Joseph Qi 
joseph...@huawei.com*wrote  

 

 Please also switch on BASTS and DLM_RECOVERY. 

 

 On 2015/12/23 10:11, gjprabu wrote: 

  HI Joseph, 

  

  Our current setup is having below details and DLM is now allowed (DLM 
allow). Do you suggest any other option to get more logs. 

  

  debugfs.ocfs2 -l 

  DLM off ( DLM allow) 

  MSG off 

  TCP off 

  CONN off 

  VOTE off 

  DLM_DOMAIN off 

  HB_BIO off 

  BASTS off 

  DLMFS off 

  ERROR allow 

  DLM_MASTER off 

  KTHREAD off 

  NOTICE allow 

  QUORUM off 

  SOCKET off 

  DLM_GLUE off 

  DLM_THREAD off 

  DLM_RECOVERY off 

  HEARTBEAT off 

  CLUSTER off 

  

  Regards 

  Prabu 

  ** 

  

  

  

   On Wed, 23 Dec 2015 07:30:54 +0530 *Joseph Qi 
joseph...@huawei.com mailto:joseph...@huawei.com*wrote  

  

  So you mean the four nodes are manually rebooted? If so you must 

  analyze messages before you rebooted. 

  If there are not enough messages, you can switch on some messages. 
IMO, 

  mostly hang problems are caused by DLM bug, so I suggest switch on 
DLM 

  related log and reproduce. 

  You can use debugfs.ocfs2 -l to show all message switches and switch 
on 

  you want. For example, 

  # debugfs.ocfs2 -l DLM allow 

  

  Thanks, 

  Joseph 

  

  On 2015/12/22 21:47, gjprabu wrote: 

   Hi Joseph, 

   

   We are facing ocfs2 server hang problem frequently and suddenly 
4 nodes going to hang stat expect 1 node. After reboot everything is come to 
normal, this behavior happend many times. Do we have any debug and fix for this 
issue. 

   

   Regards 

   Prabu 

   

   

    On Tue, 22 Dec 2015 16:30:52 +0530 *Joseph Qi 
joseph...@huawei.com mailto:joseph...@huawei.com; 
mailto:joseph...@huawei.com 
mailto:joseph...@huawei.com*wrote  

   

   Hi Prabu, 

   From the log you provided, I can only see that node 5 
disconnected with 

   node 2, 3, 1 and 4. It seemed that something wrong happened on 
the four 

   nodes, and node 5 did recovery for them. After that, the four 
nodes 

   joined again. 

   

   On 2015/12/22 16:23, gjprabu wrote: 

Hi, 



Anybody please help me on this issue. 



Regards 

Prabu 



 On Mon, 21 Dec 2015 15:16:49 +0530 *gjprabu 
gjpr...@zohocorp.com mailto:gjpr...@zohocorp.com; 
mailto:gjpr...@zohocorp.com mailto:gjpr...@zohocorp.com; 
mailto:gjpr...@zohocorp.com mailto:gjpr...@zohocorp.com; 
mailto:gjpr...@zohocorp.com 
mailto:gjpr...@zohocorp.com*wrote  



Dear Team, 



Ocfs2 clients are getting hang often and unusable. Please 
find the logs. Kindly provide the solution, it will be highly appreciated. 





[3659684.042530] o2dlm: Node 4 joins domain 
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes 



[3992993.101490] 
(kworker/u192:1,63211,24):dlm_create_lock_handler:515 ERROR: dlm status = 
DLM_IVLOCKID 

[3993002.193285] 
(kworker/u192:1,63211,24):dlm_deref_lockres_handler:2267 ERROR: 
A895BC216BE641A8A7E20AA89D57E051:M0062d2dcd0: bad lockres 
name 

[3993032.457220] 
(kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -112 when 
sending message 502 (key 0xc3460ae7) to node 2 

[3993062.547989] 
(kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -107 when 
sending message 502 (key 0xc3460ae7) to node 2 

[3993064.860776] 
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when 
sending message 502 (key 0xc3460ae7) to node 2 

[3993064.860804] o2cb: o2dlm has evicted node 2 from domain 
A895BC216BE641A8A7E20AA89D57E051 

[3993073.280062] o2dlm: Begin recovery on domain 
A895BC216BE641A8A7E20AA89D57E051 for node 2 

[3993094.623695] 
(dlm_thread,46268,8):dlm_send_proxy_ast_msg:484 ERROR: 
A895BC216BE641A8A7E20AA89D57E051: res S02, error 
-112 send AST to node 4 

[3993094.624281] (dlm_thread,46268,8):dlm_flush_asts:605 
ERROR: status = -112 

[3993094.687668] 
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -112 when 
sending message 502 (key 0xc3460ae7) to node 3 

[3993094.815662] 

Re: [Ocfs2-users] Ocfs2 clients hang

2015-12-22 Thread Joseph Qi
I don't think there is relation with packet size.
Once reproduced, you can share the messages and I will try my best if
free.

On 2015/12/23 10:45, gjprabu wrote:
> Hi Joseph,
> 
> I have enabled requested and Is the DLM log will capture to analyze 
> further. Also do we need to enable network side setting for allow max packets.
> 
> debugfs.ocfs2 -l
> DLM allow
> MSG off
> TCP off
> CONN off
> VOTE off
> DLM_DOMAIN off
> HB_BIO off
> BASTS allow
> DLMFS off
> ERROR allow
> DLM_MASTER off
> KTHREAD off
> NOTICE allow
> QUORUM off
> SOCKET off
> DLM_GLUE off
> DLM_THREAD off
> DLM_RECOVERY allow
> HEARTBEAT off
> CLUSTER off
> 
> Regards
> Prabu
> 
> 
>  On Wed, 23 Dec 2015 07:51:38 +0530 *Joseph Qi 
> *wrote 
> 
> Please also switch on BASTS and DLM_RECOVERY.
> 
> On 2015/12/23 10:11, gjprabu wrote:
> > HI Joseph,
> >
> > Our current setup is having below details and DLM is now allowed (DLM 
> allow). Do you suggest any other option to get more logs.
> >
> > debugfs.ocfs2 -l
> > DLM off ( DLM allow)
> > MSG off
> > TCP off
> > CONN off
> > VOTE off
> > DLM_DOMAIN off
> > HB_BIO off
> > BASTS off
> > DLMFS off
> > ERROR allow
> > DLM_MASTER off
> > KTHREAD off
> > NOTICE allow
> > QUORUM off
> > SOCKET off
> > DLM_GLUE off
> > DLM_THREAD off
> > DLM_RECOVERY off
> > HEARTBEAT off
> > CLUSTER off
> >
> > Regards
> > Prabu
> > **
> >
> >
> >
> >  On Wed, 23 Dec 2015 07:30:54 +0530 *Joseph Qi 
> >*wrote 
> >
> > So you mean the four nodes are manually rebooted? If so you must
> > analyze messages before you rebooted.
> > If there are not enough messages, you can switch on some messages. IMO,
> > mostly hang problems are caused by DLM bug, so I suggest switch on DLM
> > related log and reproduce.
> > You can use debugfs.ocfs2 -l to show all message switches and switch on
> > you want. For example,
> > # debugfs.ocfs2 -l DLM allow
> >
> > Thanks,
> > Joseph
> >
> > On 2015/12/22 21:47, gjprabu wrote:
> > > Hi Joseph,
> > >
> > > We are facing ocfs2 server hang problem frequently and suddenly 4 
> nodes going to hang stat expect 1 node. After reboot everything is come to 
> normal, this behavior happend many times. Do we have any debug and fix for 
> this issue.
> > >
> > > Regards
> > > Prabu
> > >
> > >
> > >  On Tue, 22 Dec 2015 16:30:52 +0530 *Joseph Qi 
>  
> >>*wrote 
> > >
> > > Hi Prabu,
> > > From the log you provided, I can only see that node 5 disconnected 
> with
> > > node 2, 3, 1 and 4. It seemed that something wrong happened on the 
> four
> > > nodes, and node 5 did recovery for them. After that, the four nodes
> > > joined again.
> > >
> > > On 2015/12/22 16:23, gjprabu wrote:
> > > > Hi,
> > > >
> > > > Anybody please help me on this issue.
> > > >
> > > > Regards
> > > > Prabu
> > > >
> > > >  On Mon, 21 Dec 2015 15:16:49 +0530 *gjprabu 
>  
> > 
>  
>  > > >
> > > > Dear Team,
> > > >
> > > > Ocfs2 clients are getting hang often and unusable. Please find the 
> logs. Kindly provide the solution, it will be highly appreciated.
> > > >
> > > >
> > > > [3659684.042530] o2dlm: Node 4 joins domain 
> A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes
> > > >
> > > > [3992993.101490] 
> (kworker/u192:1,63211,24):dlm_create_lock_handler:515 ERROR: dlm status = 
> DLM_IVLOCKID
> > > > [3993002.193285] 
> (kworker/u192:1,63211,24):dlm_deref_lockres_handler:2267 ERROR: 
> A895BC216BE641A8A7E20AA89D57E051:M0062d2dcd0: bad lockres 
> name
> > > > [3993032.457220] 
> (kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -112 when 
> sending message 502 (key 0xc3460ae7) to node 2
> > > > [3993062.547989] 
> (kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -107 when 
> sending message 502 (key 0xc3460ae7) to node 2
> > > > [3993064.860776] 
> (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when 
> sending message 502 (key 0xc3460ae7) to node 2
> > > > [3993064.860804] o2cb: o2dlm has evicted node 2 from domain 
> A895BC216BE641A8A7E20AA89D57E051
> > > > [3993073.280062] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 2
> > > > [3993094.623695] (dlm_thread,46268,8):dlm_send_proxy_ast_msg:484 
> ERROR: 

Re: [Ocfs2-users] Ocfs2 clients hang

2015-12-22 Thread gjprabu
Hi,



   Anybody please help me on this issue.



Regards

Prabu


 On Mon, 21 Dec 2015 15:16:49 +0530 gjprabu 
gjpr...@zohocorp.comwrote  




Dear Team,

 

  Ocfs2 clients are getting hang often and unusable. Please find the 
logs. Kindly provide the solution, it will be highly appreciated.





[3659684.042530] o2dlm: Node 4 joins domain A895BC216BE641A8A7E20AA89D57E051 ( 
1 2 3 4 5 ) 5 nodes



[3992993.101490] (kworker/u192:1,63211,24):dlm_create_lock_handler:515 ERROR: 
dlm status = DLM_IVLOCKID

[3993002.193285] (kworker/u192:1,63211,24):dlm_deref_lockres_handler:2267 
ERROR: A895BC216BE641A8A7E20AA89D57E051:M0062d2dcd0: bad 
lockres name

[3993032.457220] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: 
Error -112 when sending message 502 (key 0xc3460ae7) to node 2

[3993062.547989] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: 
Error -107 when sending message 502 (key 0xc3460ae7) to node 2

[3993064.860776] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: 
Error -107 when sending message 502 (key 0xc3460ae7) to node 2

[3993064.860804] o2cb: o2dlm has evicted node 2 from domain 
A895BC216BE641A8A7E20AA89D57E051

[3993073.280062] o2dlm: Begin recovery on domain 
A895BC216BE641A8A7E20AA89D57E051 for node 2

[3993094.623695] (dlm_thread,46268,8):dlm_send_proxy_ast_msg:484 ERROR: 
A895BC216BE641A8A7E20AA89D57E051: res S02, error 
-112 send AST to node 4

[3993094.624281] (dlm_thread,46268,8):dlm_flush_asts:605 ERROR: status = -112

[3993094.687668] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: 
Error -112 when sending message 502 (key 0xc3460ae7) to node 3

[3993094.815662] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: 
Error -112 when sending message 514 (key 0xc3460ae7) to node 1

[3993094.816118] (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 
ERROR: status = -112

[3993124.778525] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: 
Error -107 when sending message 514 (key 0xc3460ae7) to node 3

[3993124.779032] (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 
ERROR: status = -107

[3993133.332516] o2cb: o2dlm has evicted node 3 from domain 
A895BC216BE641A8A7E20AA89D57E051

[3993139.915122] o2cb: o2dlm has evicted node 1 from domain 
A895BC216BE641A8A7E20AA89D57E051

[3993147.071956] o2cb: o2dlm has evicted node 4 from domain 
A895BC216BE641A8A7E20AA89D57E051

[3993147.071968] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: 
Error -107 when sending message 514 (key 0xc3460ae7) to node 4

[3993147.071975] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: 
Error -107 when sending message 502 (key 0xc3460ae7) to node 4

[3993147.071997] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: 
Error -107 when sending message 502 (key 0xc3460ae7) to node 4

[3993147.072001] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: 
Error -107 when sending message 502 (key 0xc3460ae7) to node 4

[3993147.072005] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: 
Error -107 when sending message 502 (key 0xc3460ae7) to node 4

[3993147.072009] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: 
Error -107 when sending message 502 (key 0xc3460ae7) to node 4

[3993147.075019] (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 
ERROR: status = -107

[3993147.075353] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: 
link to 1 went down!

[3993147.075701] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: 
status = -107

[3993147.076001] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: 
link to 3 went down!

[3993147.076329] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: 
status = -107

[3993147.076634] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: 
link to 4 went down!

[3993147.076968] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: 
status = -107

[3993147.077275] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: 
node down! 1

[3993147.077591] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 node 3 
up while restarting

[3993147.077594] (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 
ERROR: status = -11

[3993155.171570] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: 
link to 3 went down!

[3993155.171874] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: 
status = -107

[3993155.172150] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: 
link to 4 went down!

[3993155.172446] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: 
status = -107

[3993155.172719] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: 
node down! 3

[3993155.173001] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 node 4 
up while restarting

[3993155.173003] (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 
ERROR: status = -11

[3993155.173283] 

Re: [Ocfs2-users] Ocfs2 clients hang

2015-12-22 Thread Joseph Qi
Hi Prabu,
>From the log you provided, I can only see that node 5 disconnected with
node 2, 3, 1 and 4. It seemed that something wrong happened on the four
nodes, and node 5 did recovery for them. After that, the four nodes
joined again.

On 2015/12/22 16:23, gjprabu wrote:
> Hi,
> 
>Anybody please help me on this issue.
> 
> Regards
> Prabu
> 
>  On Mon, 21 Dec 2015 15:16:49 +0530 *gjprabu *wrote 
> 
> 
> Dear Team,
>  
>   Ocfs2 clients are getting hang often and unusable. Please find 
> the logs. Kindly provide the solution, it will be highly appreciated.
> 
> 
> [3659684.042530] o2dlm: Node 4 joins domain 
> A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes
> 
> [3992993.101490] (kworker/u192:1,63211,24):dlm_create_lock_handler:515 
> ERROR: dlm status = DLM_IVLOCKID
> [3993002.193285] (kworker/u192:1,63211,24):dlm_deref_lockres_handler:2267 
> ERROR: A895BC216BE641A8A7E20AA89D57E051:M0062d2dcd0: bad 
> lockres name
> [3993032.457220] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 
> ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 2
> [3993062.547989] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2
> [3993064.860776] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2
> [3993064.860804] o2cb: o2dlm has evicted node 2 from domain 
> A895BC216BE641A8A7E20AA89D57E051
> [3993073.280062] o2dlm: Begin recovery on domain 
> A895BC216BE641A8A7E20AA89D57E051 for node 2
> [3993094.623695] (dlm_thread,46268,8):dlm_send_proxy_ast_msg:484 ERROR: 
> A895BC216BE641A8A7E20AA89D57E051: res S02, error 
> -112 send AST to node 4
> [3993094.624281] (dlm_thread,46268,8):dlm_flush_asts:605 ERROR: status = 
> -112
> [3993094.687668] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 3
> [3993094.815662] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 
> ERROR: Error -112 when sending message 514 (key 0xc3460ae7) to node 1
> [3993094.816118] 
> (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = 
> -112
> [3993124.778525] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 
> ERROR: Error -107 when sending message 514 (key 0xc3460ae7) to node 3
> [3993124.779032] 
> (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = 
> -107
> [3993133.332516] o2cb: o2dlm has evicted node 3 from domain 
> A895BC216BE641A8A7E20AA89D57E051
> [3993139.915122] o2cb: o2dlm has evicted node 1 from domain 
> A895BC216BE641A8A7E20AA89D57E051
> [3993147.071956] o2cb: o2dlm has evicted node 4 from domain 
> A895BC216BE641A8A7E20AA89D57E051
> [3993147.071968] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 
> ERROR: Error -107 when sending message 514 (key 0xc3460ae7) to node 4
> [3993147.071975] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4
> [3993147.071997] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4
> [3993147.072001] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4
> [3993147.072005] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4
> [3993147.072009] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
> ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4
> [3993147.075019] 
> (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = 
> -107
> [3993147.075353] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 
> ERROR: link to 1 went down!
> [3993147.075701] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 
> ERROR: status = -107
> [3993147.076001] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 
> ERROR: link to 3 went down!
> [3993147.076329] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 
> ERROR: status = -107
> [3993147.076634] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 
> ERROR: link to 4 went down!
> [3993147.076968] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 
> ERROR: status = -107
> [3993147.077275] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 
> ERROR: node down! 1
> [3993147.077591] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 
> node 3 up while restarting
> [3993147.077594] (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 
> ERROR: status = -11
> [3993155.171570] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 
> ERROR: link to 3 went down!
> [3993155.171874] 

Re: [Ocfs2-users] Ocfs2 clients hang

2015-12-22 Thread gjprabu
Hi Joseph,



  We are facing ocfs2 server hang problem frequently and suddenly 4 
nodes going to hang stat expect 1 node. After reboot everything is come to 
normal, this behavior happend many times. Do we have any debug and fix for this 
issue.



Regards

Prabu





 On Tue, 22 Dec 2015 16:30:52 +0530 Joseph Qi 
joseph...@huawei.comwrote  




Hi Prabu, 

>From the log you provided, I can only see that node 5 disconnected with 

node 2, 3, 1 and 4. It seemed that something wrong happened on the four 

nodes, and node 5 did recovery for them. After that, the four nodes 

joined again. 

 

On 2015/12/22 16:23, gjprabu wrote: 

 Hi, 

 

 Anybody please help me on this issue. 

 

 Regards 

 Prabu 

 

  On Mon, 21 Dec 2015 15:16:49 +0530 *gjprabu 
gjpr...@zohocorp.com*wrote  

 

 Dear Team, 

 

 Ocfs2 clients are getting hang often and unusable. Please find the logs. 
Kindly provide the solution, it will be highly appreciated. 

 

 

 [3659684.042530] o2dlm: Node 4 joins domain 
A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes 

 

 [3992993.101490] (kworker/u192:1,63211,24):dlm_create_lock_handler:515 
ERROR: dlm status = DLM_IVLOCKID 

 [3993002.193285] (kworker/u192:1,63211,24):dlm_deref_lockres_handler:2267 
ERROR: A895BC216BE641A8A7E20AA89D57E051:M0062d2dcd0: bad 
lockres name 

 [3993032.457220] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 
ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 2 

 [3993062.547989] (kworker/u192:0,67418,11):dlm_do_assert_master:1680 
ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2 

 [3993064.860776] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 2 

 [3993064.860804] o2cb: o2dlm has evicted node 2 from domain 
A895BC216BE641A8A7E20AA89D57E051 

 [3993073.280062] o2dlm: Begin recovery on domain 
A895BC216BE641A8A7E20AA89D57E051 for node 2 

 [3993094.623695] (dlm_thread,46268,8):dlm_send_proxy_ast_msg:484 ERROR: 
A895BC216BE641A8A7E20AA89D57E051: res S02, error 
-112 send AST to node 4 

 [3993094.624281] (dlm_thread,46268,8):dlm_flush_asts:605 ERROR: status = 
-112 

 [3993094.687668] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 3 

 [3993094.815662] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 
ERROR: Error -112 when sending message 514 (key 0xc3460ae7) to node 1 

 [3993094.816118] 
(dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -112 

 [3993124.778525] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 
ERROR: Error -107 when sending message 514 (key 0xc3460ae7) to node 3 

 [3993124.779032] 
(dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -107 

 [3993133.332516] o2cb: o2dlm has evicted node 3 from domain 
A895BC216BE641A8A7E20AA89D57E051 

 [3993139.915122] o2cb: o2dlm has evicted node 1 from domain 
A895BC216BE641A8A7E20AA89D57E051 

 [3993147.071956] o2cb: o2dlm has evicted node 4 from domain 
A895BC216BE641A8A7E20AA89D57E051 

 [3993147.071968] (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 
ERROR: Error -107 when sending message 514 (key 0xc3460ae7) to node 4 

 [3993147.071975] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 

 [3993147.071997] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 

 [3993147.072001] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 

 [3993147.072005] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 

 [3993147.072009] (kworker/u192:0,67418,15):dlm_do_assert_master:1680 
ERROR: Error -107 when sending message 502 (key 0xc3460ae7) to node 4 

 [3993147.075019] 
(dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -107 

 [3993147.075353] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 
ERROR: link to 1 went down! 

 [3993147.075701] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 
ERROR: status = -107 

 [3993147.076001] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 
ERROR: link to 3 went down! 

 [3993147.076329] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 
ERROR: status = -107 

 [3993147.076634] (dlm_reco_thread,46269,7):dlm_do_master_request:1347 
ERROR: link to 4 went down! 

 [3993147.076968] (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 
ERROR: status = -107 

 [3993147.077275] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 
ERROR: node down! 1 

 [3993147.077591] (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 
node 3 up while restarting 

 [3993147.077594] (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053