If system hangs, you should figure out which process as well as its stack before restarting the system.
On 2015/12/28 20:16, gjprabu wrote: > Joseph, > > > Do you feel anything like kernel issue in below logs. After certain > point of time no dlm logs found. > > > > Dec 27 22:21:22 integ-hm8 kernel: (ocfs2rec,68213,10):dlmconvert_remote:270 > type=0, convert_type=-1, busy=0 > Dec 27 22:21:22 integ-hm8 kernel: (ocfs2rec,68213,10):dlmconvert_remote:275 > bailing out early since res is RECOVERING on secondary queue > Dec 27 22:21:22 integ-hm8 kernel: (ocfs2rec,68213,10):dlmlock:652 retrying > convert with migration/recovery/in-progress > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,17):dlm_send_one_lockres:1289 sending to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,17):dlm_send_mig_lockres_msg:1138 > A895BC216BE641A8A7E20AA89D57E051:M000000000000008ec64cbc00000000: sending mig > lockres (recovery) to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138 > A895BC216BE641A8A7E20AA89D57E051:O000000000000006e7df72000000000: sending mig > lockres (recovery) to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138 > A895BC216BE641A8A7E20AA89D57E051:M000000000000006e7df72600000000: sending mig > lockres (recovery) to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138 > A895BC216BE641A8A7E20AA89D57E051:O000000000000006e7df72600000000: sending mig > lockres (recovery) to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138 > A895BC216BE641A8A7E20AA89D57E051:M000000000000006e7df72900000000: sending mig > lockres (recovery) to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,9):dlm_send_mig_lockres_msg:1138 > A895BC216BE641A8A7E20AA89D57E051:O000000000000006e7df72900000000: sending mig > lockres (recovery) to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,9):dlm_send_mig_lockres_msg:1138 > A895BC216BE641A8A7E20AA89D57E051:M000000000000006e7df72c00000000: sending mig > lockres (recovery) to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,9):dlm_send_mig_lockres_msg:1138 > A895BC216BE641A8A7E20AA89D57E051:O000000000000006e7df72c00000000: sending mig > lockres (recovery) to 2 > Dec 27 22:21:22 integ-hm8 kernel: > (kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2 > Dec 27 22:55:52 integ-hm8 rsyslogd: [origin software="rsyslogd" > swVersion="7.4.7" x-pid="1331" x-info="http://www.rsyslog.com"] exiting on > signal 15. > Dec 27 23:05:25 integ-hm8 rsyslogd: [origin software="rsyslogd" > swVersion="7.4.7" x-pid="1346" x-info="http://www.rsyslog.com"] start > Dec 27 23:05:08 integ-hm8 journal: Runtime journal is using 8.0M (max 4.0G, > leaving 4.0G of free 251.9G, current limit 4.0G). > Dec 27 23:05:08 integ-hm8 kernel: Initializing cgroup subsys cpuset > Dec 27 23:05:08 integ-hm8 kernel: Initializing cgroup subsys cpu > Dec 27 23:05:08 integ-hm8 kernel: Initializing cgroup subsys cpuacct > Dec 27 23:05:08 integ-hm8 kernel: Linux version 3.10.91 (root@integ-hm8) (gcc > version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Oct 29 11:52:34 > IST 2015 > Dec 27 23:05:08 integ-hm8 kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.91 > root=UUID=20d0873d-8f3e-455a-ba67-f8f336b5e9a7 ro crashkernel=auto rhgb quiet > LANG=en_US.UTF-8 systemd.debug > Dec 27 23:05:08 integ-hm8 kernel: e820: BIOS-provided physical RAM map: > Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem > 0x0000000000000000-0x000000000009bfff] usable > Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem > 0x0000000000100000-0x00000000bd2cffff] usable > Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem > 0x00000000bd2d0000-0x00000000bd2fbfff] reserved > Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem > 0x00000000bd2fc000-0x00000000bd35afff] ACPI data > Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem > 0x00000000bd35b000-0x00000000bfffffff] reserved > Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem > 0x00000000e0000000-0x00000000efffffff] reserved > Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem > 0x00000000fe000000-0x00000000ffffffff] reserved > Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem > 0x0000000100000000-0x000000803fffffff] usable > Dec 27 23:05:08 integ-hm8 kernel: NX (Execute Disable) protection: active > Dec 27 23:05:08 integ-hm8 kernel: SMBIOS 2.7 present. > Dec 27 23:05:08 integ-hm8 kernel: No AGP bridge found > Dec 27 23:05:08 integ-hm8 kernel: e820: last_pfn = 0x8040000 max_arch_pfn = > 0x400000000 > Dec 27 23:05:08 integ-hm8 kernel: x86 PAT enabled: cpu 0, old > 0x7040600070406, new 0x7010600070106 > Dec 27 23:05:08 integ-hm8 kernel: total RAM covered: 524288M > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 64K > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 128K > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 256K > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 512K > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 1M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 2M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 4M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 8M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 16M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 32M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 64M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 128M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 256M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 512M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 1G > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: *BAD*gran_size: 64K chunk_size: 2G > num_reg: 10 lose cover RAM: -1G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 128K > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 256K > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 512K > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 1M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 2M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 4M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 8M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 16M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 32M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 64M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 128M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 256M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 512M > num_reg: 10 lose cover RAM: 0G > Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 1G > num_reg: 10 lose cover RAM: 0G > > > Regards > Prabu > > > > ---- On Mon, 28 Dec 2015 15:18:02 +0530 *gjprabu <gjpr...@zohocorp.com>*wrote > ---- > > Yes, its got hanged all 5 nodes, after restart everything fine > > Regards > Prabu > ** > > > > ---- On Mon, 28 Dec 2015 15:00:57 +0530 *Joseph Qi <joseph...@huawei.com > <mailto:joseph...@huawei.com>>*wrote ---- > > > > So which process hangs? And which lockres it is waiting for? > From the log I cannot get those information. > > On 2015/12/28 16:46, gjprabu wrote: > > Hi Joseph, > > > > Again we are facing same issue. Please find the logs when the > problem occurred. > > > > Dec 27 21:45:44 integ-hm5 kernel: > (dlm_thread,46268,24):dlm_update_lvb:206 getting lvb from lockres for master > node > > Dec 27 21:45:44 integ-hm5 kernel: > (dlm_thread,46268,24):ocfs2_locking_ast:1076 AST fired for lockres > M000000000000008478220200000000, action 1, unlock 0, level -1 => 3 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__ocfs2_cluster_lock:1465 lockres N000000008340963d, convert > from -1 to 3 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_get_lock_resource:724 get lockres N000000008340963d (len > 31) > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_lookup_lockres_full:198 N000000008340963d > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_lockres_grab_inflight_ref:663 > A895BC216BE641A8A7E20AA89D57E051: res N000000008340963d, inflight++: now 1, > dlm_lockres_grab_inflight_ref() > > Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock:690 > type=3, flags = 0x0 > > Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock:691 > creating lock: lock=ffff8801824b4500 res=ffff88265dbf2bc0 > > Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock_master:131 > type=3 > > Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock_master:148 > I can grant this lock right away > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_dirty_lockres:483 A895BC216BE641A8A7E20AA89D57E051: res > N000000008340963d > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_lockres_drop_inflight_ref:684 > A895BC216BE641A8A7E20AA89D57E051: res N000000008340963d, inflight--: now 0, > dlmlock() > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_dirty_lockres:483 A895BC216BE641A8A7E20AA89D57E051: res > N000000008340963d > > Dec 27 21:45:44 integ-hm5 kernel: > (dlm_thread,46268,24):dlm_flush_asts:541 A895BC216BE641A8A7E20AA89D57E051: > res N000000008340963d, Flush AST for lock 5:441609912, type 3, node 5 > > Dec 27 21:45:44 integ-hm5 kernel: > (dlm_thread,46268,24):dlm_do_local_ast:232 A895BC216BE641A8A7E20AA89D57E051: > res N000000008340963d, lock 5:441609912, Local AST > > Dec 27 21:45:44 integ-hm5 kernel: > (dlm_thread,46268,24):ocfs2_locking_ast:1076 AST fired for lockres > N000000008340963d, action 1, unlock 0, level -1 => 3 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__ocfs2_cluster_lock:1465 lockres > O000000000000008478220400000000, convert from -1 to 3 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_get_lock_resource:724 get lockres > O000000000000008478220400000000 (len 31) > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_lookup_lockres_full:198 O000000000000008478220400000000 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_get_lock_resource:778 allocating a new resource > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_lookup_lockres_full:198 O000000000000008478220400000000 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_get_lock_resource:789 no lockres found, allocated our own: > ffff880717e38780 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_insert_lockres:187 A895BC216BE641A8A7E20AA89D57E051: > Hash res O000000000000008478220400000000 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_lockres_grab_inflight_ref:663 > A895BC216BE641A8A7E20AA89D57E051: res O000000000000008478220400000000, > inflight++: now 1, dlm_get_lock_resource() > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_do_master_request:1364 node 1 not master, response=NO > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_do_master_request:1364 node 2 not master, response=NO > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_do_master_request:1364 node 3 not master, response=NO > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_do_master_request:1364 node 4 not master, response=NO > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_wait_for_lock_mastery:1122 about to master > O000000000000008478220400000000 here, this=5 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_do_assert_master:1668 sending assert master to 1 > (O000000000000008478220400000000) > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):dlm_do_assert_master:1668 sending assert master to 2 > (O000000000000008478220400000000) > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):dlm_do_assert_master:1668 sending assert master to 3 > (O000000000000008478220400000000) > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):dlm_do_assert_master:1668 sending assert master to 4 > (O000000000000008478220400000000) > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):dlm_get_lock_resource:968 A895BC216BE641A8A7E20AA89D57E051: > res O000000000000008478220400000000, Mastered by 5 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):dlm_mle_release:436 Releasing mle for > O000000000000008478220400000000, type 1 > > Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlmlock:690 > type=3, flags = 0x0 > > Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlmlock:691 > creating lock: lock=ffff8801824b4680 res=ffff880717e38780 > > Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlmlock_master:131 > type=3 > > Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlmlock_master:148 > I can grant this lock right away > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):__dlm_dirty_lockres:483 A895BC216BE641A8A7E20AA89D57E051: res > O000000000000008478220400000000 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):dlm_lockres_drop_inflight_ref:684 > A895BC216BE641A8A7E20AA89D57E051: res O000000000000008478220400000000, > inflight--: now 0, dlmlock() > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):__dlm_dirty_lockres:483 A895BC216BE641A8A7E20AA89D57E051: res > O000000000000008478220400000000 > > Dec 27 21:45:44 integ-hm5 kernel: > (dlm_thread,46268,24):dlm_flush_asts:541 A895BC216BE641A8A7E20AA89D57E051: > res O000000000000008478220400000000, Flush AST for lock 5:441609913, type 3, > node 5 > > Dec 27 21:45:44 integ-hm5 kernel: > (dlm_thread,46268,24):dlm_do_local_ast:232 A895BC216BE641A8A7E20AA89D57E051: > res O000000000000008478220400000000, lock 5:441609913, Local AST > > Dec 27 21:45:44 integ-hm5 kernel: > (dlm_thread,46268,24):ocfs2_locking_ast:1076 AST fired for lockres > O000000000000008478220400000000, action 1, unlock 0, level -1 => 3 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):__ocfs2_cluster_lock:1465 lockres > M000000000000008478220400000000, convert from -1 to 3 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):dlm_get_lock_resource:724 get lockres > M000000000000008478220400000000 (len 31) > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):__dlm_lookup_lockres_full:198 M000000000000008478220400000000 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):dlm_get_lock_resource:778 allocating a new resource > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):__dlm_lookup_lockres_full:198 M000000000000008478220400000000 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):dlm_get_lock_resource:789 no lockres found, allocated our own: > ffff8803ba843e80 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):__dlm_insert_lockres:187 A895BC216BE641A8A7E20AA89D57E051: > Hash res M000000000000008478220400000000 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):__dlm_lockres_grab_inflight_ref:663 > A895BC216BE641A8A7E20AA89D57E051: res M000000000000008478220400000000, > inflight++: now 1, dlm_get_lock_resource() > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,4):dlm_do_master_request:1364 node 1 not master, response=NODec > 27 21:45:44 integ-hm5 kernel: (dlm_thread,46268,24):dlm_update_lvb:206 > getting lvb from lockres for master node > > Dec 27 21:45:44 integ-hm5 kernel: > (dlm_thread,46268,24):ocfs2_locking_ast:1076 AST fired for lockres > M000000000000008478220200000000, action 1, unlock 0, level -1 => 3 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__ocfs2_cluster_lock:1465 lockres N000000008340963d, convert > from -1 to 3 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_get_lock_resource:724 get lockres N000000008340963d (len > 31) > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_lookup_lockres_full:198 N000000008340963d > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_lockres_grab_inflight_ref:663 > A895BC216BE641A8A7E20AA89D57E051: res N000000008340963d, inflight++: now 1, > dlm_lockres_grab_inflight_ref() > > Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock:690 > type=3, flags = 0x0 > > Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock:691 > creating lock: lock=ffff8801824b4500 res=ffff88265dbf2bc0 > > Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock_master:131 > type=3 > > Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock_master:148 > I can grant this lock right away > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_dirty_lockres:483 A895BC216BE641A8A7E20AA89D57E051: res > N000000008340963d > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_lockres_drop_inflight_ref:684 > A895BC216BE641A8A7E20AA89D57E051: res N000000008340963d, inflight--: now 0, > dlmlock() > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_dirty_lockres:483 A895BC216BE641A8A7E20AA89D57E051: res > N000000008340963d > > Dec 27 21:45:44 integ-hm5 kernel: > (dlm_thread,46268,24):dlm_flush_asts:541 A895BC216BE641A8A7E20AA89D57E051: > res N000000008340963d, Flush AST for lock 5:441609912, type 3, node 5 > > Dec 27 21:45:44 integ-hm5 kernel: > (dlm_thread,46268,24):dlm_do_local_ast:232 A895BC216BE641A8A7E20AA89D57E051: > res N000000008340963d, lock 5:441609912, Local AST > > Dec 27 21:45:44 integ-hm5 kernel: > (dlm_thread,46268,24):ocfs2_locking_ast:1076 AST fired for lockres > N000000008340963d, action 1, unlock 0, level -1 => 3 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__ocfs2_cluster_lock:1465 lockres > O000000000000008478220400000000, convert from -1 to 3 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_get_lock_resource:724 get lockres > O000000000000008478220400000000 (len 31) > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_lookup_lockres_full:198 O000000000000008478220400000000 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_get_lock_resource:778 allocating a new resource > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_lookup_lockres_full:198 O000000000000008478220400000000 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):dlm_get_lock_resource:789 no lockres found, allocated our own: > ffff880717e38780 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_insert_lockres:187 A895BC216BE641A8A7E20AA89D57E051: > Hash res O000000000000008478220400000000 > > Dec 27 21:45:44 integ-hm5 kernel: > (nvfs,91539,0):__dlm_lockres_grab_inflight_ref:663 > A895BC216BE641A8A7E20AA89D57E051: res O000000000000008478220400000000, > inflight++: now 1, dlm_get_lock_resource() > > > > Regards > > Prabu GJ > > ** > > > > > > > > ---- On Wed, 23 Dec 2015 10:05:10 +0530 *gjprabu > <gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com>>*wrote ---- > > > > **Ok, thanks > > > > > > ---- On Wed, 23 Dec 2015 09:08:13 +0530 *Joseph Qi > <joseph...@huawei.com <mailto:joseph...@huawei.com> > <mailto:joseph...@huawei.com <mailto:joseph...@huawei.com>>>*wrote ---- > > > > > > > > I don't think there is relation with packet size. > > Once reproduced, you can share the messages and I will try my best > if > > free. > > > > On 2015/12/23 10:45, gjprabu wrote: > > > Hi Joseph, > > > > > > I have enabled requested and Is the DLM log will capture to > analyze further. Also do we need to enable network side setting for allow max > packets. > > > > > > debugfs.ocfs2 -l > > > DLM allow > > > MSG off > > > TCP off > > > CONN off > > > VOTE off > > > DLM_DOMAIN off > > > HB_BIO off > > > BASTS allow > > > DLMFS off > > > ERROR allow > > > DLM_MASTER off > > > KTHREAD off > > > NOTICE allow > > > QUORUM off > > > SOCKET off > > > DLM_GLUE off > > > DLM_THREAD off > > > DLM_RECOVERY allow > > > HEARTBEAT off > > > CLUSTER off > > > > > > Regards > > > Prabu > > > > > > > > > ---- On Wed, 23 Dec 2015 07:51:38 +0530 *Joseph Qi > <joseph...@huawei.com <mailto:joseph...@huawei.com> > <mailto:joseph...@huawei.com <mailto:joseph...@huawei.com>>>*wrote ---- > > > > > > Please also switch on BASTS and DLM_RECOVERY. > > > > > > On 2015/12/23 10:11, gjprabu wrote: > > > > HI Joseph, > > > > > > > > Our current setup is having below details and DLM is now > allowed (DLM allow). Do you suggest any other option to get more logs. > > > > > > > > debugfs.ocfs2 -l > > > > DLM off ( DLM allow) > > > > MSG off > > > > TCP off > > > > CONN off > > > > VOTE off > > > > DLM_DOMAIN off > > > > HB_BIO off > > > > BASTS off > > > > DLMFS off > > > > ERROR allow > > > > DLM_MASTER off > > > > KTHREAD off > > > > NOTICE allow > > > > QUORUM off > > > > SOCKET off > > > > DLM_GLUE off > > > > DLM_THREAD off > > > > DLM_RECOVERY off > > > > HEARTBEAT off > > > > CLUSTER off > > > > > > > > Regards > > > > Prabu > > > > ** > > > > > > > > > > > > > > > > ---- On Wed, 23 Dec 2015 07:30:54 +0530 *Joseph Qi > <joseph...@huawei.com <mailto:joseph...@huawei.com> > <mailto:joseph...@huawei.com <mailto:joseph...@huawei.com>> > <mailto:joseph...@huawei.com <mailto:joseph...@huawei.com> > <mailto:joseph...@huawei.com <mailto:joseph...@huawei.com>>>>*wrote ---- > > > > > > > > So you mean the four nodes are manually rebooted? If so you must > > > > analyze messages before you rebooted. > > > > If there are not enough messages, you can switch on some > messages. IMO, > > > > mostly hang problems are caused by DLM bug, so I suggest switch > on DLM > > > > related log and reproduce. > > > > You can use debugfs.ocfs2 -l to show all message switches and > switch on > > > > you want. For example, > > > > # debugfs.ocfs2 -l DLM allow > > > > > > > > Thanks, > > > > Joseph > > > > > > > > On 2015/12/22 21:47, gjprabu wrote: > > > > > Hi Joseph, > > > > > > > > > > We are facing ocfs2 server hang problem frequently and > suddenly 4 nodes going to hang stat expect 1 node. After reboot everything is > come to normal, this behavior happend many times. Do we have any debug and > fix for this issue. > > > > > > > > > > Regards > > > > > Prabu > > > > > > > > > > > > > > > ---- On Tue, 22 Dec 2015 16:30:52 +0530 *Joseph Qi > <joseph...@huawei.com <mailto:joseph...@huawei.com> > <mailto:joseph...@huawei.com <mailto:joseph...@huawei.com>> > <mailto:joseph...@huawei.com <mailto:joseph...@huawei.com> > <mailto:joseph...@huawei.com <mailto:joseph...@huawei.com>>> > <mailto:joseph...@huawei.com <mailto:joseph...@huawei.com> > <mailto:joseph...@huawei.com <mailto:joseph...@huawei.com>> > <mailto:joseph...@huawei.com <mailto:joseph...@huawei.com> > <mailto:joseph...@huawei.com <mailto:joseph...@huawei.com>>>>>*wrote ---- > > > > > > > > > > Hi Prabu, > > > > > From the log you provided, I can only see that node 5 > disconnected with > > > > > node 2, 3, 1 and 4. It seemed that something wrong happened > on the four > > > > > nodes, and node 5 did recovery for them. After that, the four > nodes > > > > > joined again. > > > > > > > > > > On 2015/12/22 16:23, gjprabu wrote: > > > > > > Hi, > > > > > > > > > > > > Anybody please help me on this issue. > > > > > > > > > > > > Regards > > > > > > Prabu > > > > > > > > > > > > ---- On Mon, 21 Dec 2015 15:16:49 +0530 *gjprabu > <gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com> > <mailto:gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com>> > <mailto:gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com> > <mailto:gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com>>> > <mailto:gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com> > <mailto:gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com>> > <mailto:gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com> > <mailto:gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com>>>> > <mailto:gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com> > <mailto:gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com>> > <mailto:gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com> > <mailto:gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com>>> > <mailto:gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com> > <mailto:gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com>> > <mailto:gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com> > <mailto:gjpr...@zohocorp.com > <mailto:gjpr...@zohocorp.com>>>>>>*wrote ---- > > > > > > > > > > > > Dear Team, > > > > > > > > > > > > Ocfs2 clients are getting hang often and unusable. Please > find the logs. Kindly provide the solution, it will be highly appreciated. > > > > > > > > > > > > > > > > > > [3659684.042530] o2dlm: Node 4 joins domain > A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes > > > > > > > > > > > > [3992993.101490] > (kworker/u192:1,63211,24):dlm_create_lock_handler:515 ERROR: dlm status = > DLM_IVLOCKID > > > > > > [3993002.193285] > (kworker/u192:1,63211,24):dlm_deref_lockres_handler:2267 ERROR: > A895BC216BE641A8A7E20AA89D57E051:M0000000000000062d2dcd000000000: bad lockres > name > > > > > > [3993032.457220] > (kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -112 when > sending message 502 (key 0xc3460ae7) to node 2 > > > > > > [3993062.547989] > (kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -107 when > sending message 502 (key 0xc3460ae7) to node 2 > > > > > > [3993064.860776] > (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when > sending message 502 (key 0xc3460ae7) to node 2 > > > > > > [3993064.860804] o2cb: o2dlm has evicted node 2 from domain > A895BC216BE641A8A7E20AA89D57E051 > > > > > > [3993073.280062] o2dlm: Begin recovery on domain > A895BC216BE641A8A7E20AA89D57E051 for node 2 > > > > > > [3993094.623695] > (dlm_thread,46268,8):dlm_send_proxy_ast_msg:484 ERROR: > A895BC216BE641A8A7E20AA89D57E051: res S000000000000000000000200000000, error > -112 send AST to node 4 > > > > > > [3993094.624281] (dlm_thread,46268,8):dlm_flush_asts:605 > ERROR: status = -112 > > > > > > [3993094.687668] > (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -112 when > sending message 502 (key 0xc3460ae7) to node 3 > > > > > > [3993094.815662] > (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -112 when > sending message 514 (key 0xc3460ae7) to node 1 > > > > > > [3993094.816118] > (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = > -112 > > > > > > [3993124.778525] > (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -107 when > sending message 514 (key 0xc3460ae7) to node 3 > > > > > > [3993124.779032] > (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = > -107 > > > > > > [3993133.332516] o2cb: o2dlm has evicted node 3 from domain > A895BC216BE641A8A7E20AA89D57E051 > > > > > > [3993139.915122] o2cb: o2dlm has evicted node 1 from domain > A895BC216BE641A8A7E20AA89D57E051 > > > > > > [3993147.071956] o2cb: o2dlm has evicted node 4 from domain > A895BC216BE641A8A7E20AA89D57E051 > > > > > > [3993147.071968] > (dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -107 when > sending message 514 (key 0xc3460ae7) to node 4 > > > > > > [3993147.071975] > (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when > sending message 502 (key 0xc3460ae7) to node 4 > > > > > > [3993147.071997] > (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when > sending message 502 (key 0xc3460ae7) to node 4 > > > > > > [3993147.072001] > (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when > sending message 502 (key 0xc3460ae7) to node 4 > > > > > > [3993147.072005] > (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when > sending message 502 (key 0xc3460ae7) to node 4 > > > > > > [3993147.072009] > (kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when > sending message 502 (key 0xc3460ae7) to node 4 > > > > > > [3993147.075019] > (dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = > -107 > > > > > > [3993147.075353] > (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 1 went > down! > > > > > > [3993147.075701] > (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > > > > > [3993147.076001] > (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 3 went > down! > > > > > > [3993147.076329] > (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > > > > > [3993147.076634] > (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went > down! > > > > > > [3993147.076968] > (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > > > > > [3993147.077275] > (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 1 > > > > > > [3993147.077591] > (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 node 3 up while > restarting > > > > > > [3993147.077594] > (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11 > > > > > > [3993155.171570] > (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 3 went > down! > > > > > > [3993155.171874] > (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > > > > > [3993155.172150] > (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went > down! > > > > > > [3993155.172446] > (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > > > > > [3993155.172719] > (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 3 > > > > > > [3993155.173001] > (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 node 4 up while > restarting > > > > > > [3993155.173003] > (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11 > > > > > > [3993155.173283] > (dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went > down! > > > > > > [3993155.173581] > (dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107 > > > > > > [3993155.173858] > (dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 4 > > > > > > [3993155.174135] > (dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11 > > > > > > [3993155.174458] o2dlm: Node 5 (me) is the Recovery Master > for the dead node 2 in domain A895BC216BE641A8A7E20AA89D57E051 > > > > > > [3993158.361220] o2dlm: End recovery on domain > A895BC216BE641A8A7E20AA89D57E051 > > > > > > [3993158.361228] o2dlm: Begin recovery on domain > A895BC216BE641A8A7E20AA89D57E051 for node 1 > > > > > > [3993158.361305] o2dlm: Node 5 (me) is the Recovery Master > for the dead node 1 in domain A895BC216BE641A8A7E20AA89D57E051 > > > > > > [3993161.833543] o2dlm: End recovery on domain > A895BC216BE641A8A7E20AA89D57E051 > > > > > > [3993161.833551] o2dlm: Begin recovery on domain > A895BC216BE641A8A7E20AA89D57E051 for node 3 > > > > > > [3993161.833620] o2dlm: Node 5 (me) is the Recovery Master > for the dead node 3 in domain A895BC216BE641A8A7E20AA89D57E051 > > > > > > [3993165.188817] o2dlm: End recovery on domain > A895BC216BE641A8A7E20AA89D57E051 > > > > > > [3993165.188826] o2dlm: Begin recovery on domain > A895BC216BE641A8A7E20AA89D57E051 for node 4 > > > > > > [3993165.188907] o2dlm: Node 5 (me) is the Recovery Master > for the dead node 4 in domain A895BC216BE641A8A7E20AA89D57E051 > > > > > > [3993168.551610] o2dlm: End recovery on domain > A895BC216BE641A8A7E20AA89D57E051 > > > > > > > > > > > > [3996486.869628] o2dlm: Node 4 joins domain > A895BC216BE641A8A7E20AA89D57E051 ( 4 5 ) 2 nodes > > > > > > [3996778.703664] o2dlm: Node 4 leaves domain > A895BC216BE641A8A7E20AA89D57E051 ( 5 ) 1 nodes > > > > > > [3997012.295536] o2dlm: Node 2 joins domain > A895BC216BE641A8A7E20AA89D57E051 ( 2 5 ) 2 nodes > > > > > > [3997099.498157] o2dlm: Node 4 joins domain > A895BC216BE641A8A7E20AA89D57E051 ( 2 4 5 ) 3 nodes > > > > > > [3997783.633140] o2dlm: Node 1 joins domain > A895BC216BE641A8A7E20AA89D57E051 ( 1 2 4 5 ) 4 nodes > > > > > > [3997864.039868] o2dlm: Node 3 joins domain > A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes > > > > > > > > > > > > Regards > > > > > > Prabu > > > > > > ** > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Ocfs2-users mailing list > > > > > > Ocfs2-users@oss.oracle.com > <mailto:Ocfs2-users@oss.oracle.com> <mailto:Ocfs2-users@oss.oracle.com > <mailto:Ocfs2-users@oss.oracle.com>> <mailto:Ocfs2-users@oss.oracle.com > <mailto:Ocfs2-users@oss.oracle.com> <mailto:Ocfs2-users@oss.oracle.com > <mailto:Ocfs2-users@oss.oracle.com>>> <mailto:Ocfs2-users@oss.oracle.com > <mailto:Ocfs2-users@oss.oracle.com> <mailto:Ocfs2-users@oss.oracle.com > <mailto:Ocfs2-users@oss.oracle.com>> <mailto:Ocfs2-users@oss.oracle.com > <mailto:Ocfs2-users@oss.oracle.com> <mailto:Ocfs2-users@oss.oracle.com > <mailto:Ocfs2-users@oss.oracle.com>>>> <mailto:Ocfs2-users@oss.oracle.com > <mailto:Ocfs2-users@oss.oracle.com> <mailto:Ocfs2-users@oss.oracle.com > <mailto:Ocfs2-users@oss.oracle.com>> <mailto:Ocfs2-users@oss.oracle.com > <mailto:Ocfs2-users@oss.oracle.com> <mailto:Ocfs2-users@oss.oracle.com > <mailto:Ocfs2-users@oss.oracle.com>>> <mailto:Ocfs2-users@oss.oracle.com > <mailto:Ocfs2-users@oss.oracle.com> <mailto:Ocfs2-users@oss.oracle.com > <mailto:Ocfs2-users@oss.oracle.com>> > <mailto:Ocfs2-users@oss.oracle.com <mailto:Ocfs2-users@oss.oracle.com> > <mailto:Ocfs2-users@oss.oracle.com <mailto:Ocfs2-users@oss.oracle.com>>>>> > > > > > > https://oss.oracle.com/mailman/listinfo/ocfs2-users > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users