Hello,

So we have created the new cluster finally 3 identical KVMs:

-8 vCPUs
-10GB ram per node
-Kernel custom 4.13.2OCFS
-All the 3 VMs running on a dell host server which have more than enough 
resources so network connection between the VMs cannot be an issue yet 
(we will move them to separate physical servers once they become rock 
solid)

Until 9 days it was running fine until Today one of the webservers 
decided to crash on OCFS2 again.

Here is the picture of the crashed server:

https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_kxSqLm&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtYn-0afBpa7A&m=LIe0FuKdHS00KQDpalNr3sC8x4IUbJAxr9ZbKkaVVRU&s=CU8iQ7bMjz3onn2KVgChw_n06syWA6OAYpbd1hl6mfw&e=
 


And the log from the other nodes:

Oct 27 13:11:06 webserver2 kernel: [789844.406061] o2net: Connection to 
node webserver3 (num 2) at 10.0.0.247:7777 has been idle for 30.688 
secs.
Oct 27 13:11:36 webserver2 kernel: [789875.125863] o2net: Connection to 
node webserver3 (num 2) at 10.0.0.247:7777 has been idle for 30.720 
secs.
Oct 27 13:11:40 webserver2 kernel: [789878.935510] o2net: No longer 
connected to node webserver3 (num 2) at 10.0.0.247:7777
Oct 27 13:11:40 webserver2 kernel: [789878.935924] o2cb: o2dlm has 
evicted node 2 from domain 428503AACBAA492D84DFA48C5CF305B4
Oct 27 13:11:40 webserver2 kernel: [789879.050040] o2cb: o2dlm has 
evicted node 2 from domain E6CEF44C077640538468D6FCD1E27C5F
Oct 27 13:11:41 webserver2 kernel: [789880.245846] o2dlm: Begin recovery 
on domain 428503AACBAA492D84DFA48C5CF305B4 for node 2
Oct 27 13:11:41 webserver2 kernel: [789880.246863] o2dlm: Node 1 (me) is 
the Recovery Master for the dead node 2 in domain 
428503AACBAA492D84DFA48C5CF305B4
Oct 27 13:11:41 webserver2 kernel: [789880.325817] o2dlm: End recovery 
on domain 428503AACBAA492D84DFA48C5CF305B4
Oct 27 13:11:42 webserver2 kernel: [789880.501802] o2dlm: Begin recovery 
on domain E6CEF44C077640538468D6FCD1E27C5F for node 2
Oct 27 13:11:42 webserver2 kernel: [789880.502841] o2dlm: Node 1 (me) is 
the Recovery Master for the dead node 2 in domain 
E6CEF44C077640538468D6FCD1E27C5F
Oct 27 13:11:47 webserver2 kernel: [789885.629843] o2dlm: End recovery 
on domain E6CEF44C077640538468D6FCD1E27C5F
Oct 27 13:11:47 webserver2 kernel: [789885.684062] ocfs2: Begin replay 
journal (node 2, slot 1) on device (254,64)
Oct 27 13:11:47 webserver2 kernel: [789885.707354] ocfs2: End replay 
journal (node 2, slot 1) on device (254,64)
Oct 27 13:11:47 webserver2 kernel: [789885.737907] ocfs2: Beginning 
quota recovery on device (254,64) for slot 1
Oct 27 13:11:47 webserver2 kernel: [789885.757285] ocfs2: Finishing 
quota recovery on device (254,64) for slot 1
Oct 27 13:19:40 webserver2 kernel: [790358.453142] php-fpm7.0      D    
0  8659   8654 0x00000000
Oct 27 13:19:40 webserver2 kernel: [790358.453145] Call Trace:
Oct 27 13:19:40 webserver2 kernel: [790358.453153]  ? 
__schedule+0x3c8/0x860
Oct 27 13:19:40 webserver2 kernel: [790358.453155]  ? schedule+0x32/0x80
Oct 27 13:19:40 webserver2 kernel: [790358.453158]  ? 
rwsem_down_write_failed+0x232/0x410
Oct 27 13:19:40 webserver2 kernel: [790358.453160]  ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.453164]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.453165]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.453167]  ? 
down_write+0x29/0x40
Oct 27 13:19:40 webserver2 kernel: [790358.453170]  ? 
path_openat+0x3dc/0x1440
Oct 27 13:19:40 webserver2 kernel: [790358.453227]  ? 
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:19:40 webserver2 kernel: [790358.453230]  ? 
do_filp_open+0x99/0x110
Oct 27 13:19:40 webserver2 kernel: [790358.453232]  ? 
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:19:40 webserver2 kernel: [790358.453234]  ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.453236]  ? 
__check_object_size+0xb3/0x190
Oct 27 13:19:40 webserver2 kernel: [790358.453238]  ? 
__alloc_fd+0x44/0x170
Oct 27 13:19:40 webserver2 kernel: [790358.453240]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.453241]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.453243]  ? 
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:19:40 webserver2 kernel: [790358.455597] php-fpm7.0      D    
0  8662   8654 0x00000000
Oct 27 13:19:40 webserver2 kernel: [790358.455624] Call Trace:
Oct 27 13:19:40 webserver2 kernel: [790358.455628]  ? 
__schedule+0x3c8/0x860
Oct 27 13:19:40 webserver2 kernel: [790358.455630]  ? schedule+0x32/0x80
Oct 27 13:19:40 webserver2 kernel: [790358.455632]  ? 
rwsem_down_write_failed+0x232/0x410
Oct 27 13:19:40 webserver2 kernel: [790358.455634]  ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.455637]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.455639]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.455640]  ? 
down_write+0x29/0x40
Oct 27 13:19:40 webserver2 kernel: [790358.455642]  ? 
path_openat+0x3dc/0x1440
Oct 27 13:19:40 webserver2 kernel: [790358.455678]  ? 
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:19:40 webserver2 kernel: [790358.455680]  ? 
do_filp_open+0x99/0x110
Oct 27 13:19:40 webserver2 kernel: [790358.455682]  ? 
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:19:40 webserver2 kernel: [790358.455696]  ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.455698]  ? 
__check_object_size+0xb3/0x190
Oct 27 13:19:40 webserver2 kernel: [790358.455700]  ? 
__alloc_fd+0x44/0x170
Oct 27 13:19:40 webserver2 kernel: [790358.455702]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.455704]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.455706]  ? 
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:19:40 webserver2 kernel: [790358.458274] php-fpm7.0      D    
0  8700   8654 0x00000000
Oct 27 13:19:40 webserver2 kernel: [790358.458277] Call Trace:
Oct 27 13:19:40 webserver2 kernel: [790358.458280]  ? 
__schedule+0x3c8/0x860
Oct 27 13:19:40 webserver2 kernel: [790358.458282]  ? schedule+0x32/0x80
Oct 27 13:19:40 webserver2 kernel: [790358.458284]  ? 
rwsem_down_write_failed+0x232/0x410
Oct 27 13:19:40 webserver2 kernel: [790358.458286]  ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.458289]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.458290]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.458292]  ? 
down_write+0x29/0x40
Oct 27 13:19:40 webserver2 kernel: [790358.458294]  ? 
path_openat+0x3dc/0x1440
Oct 27 13:19:40 webserver2 kernel: [790358.458330]  ? 
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:19:40 webserver2 kernel: [790358.458332]  ? 
do_filp_open+0x99/0x110
Oct 27 13:19:40 webserver2 kernel: [790358.458334]  ? 
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:19:40 webserver2 kernel: [790358.458336]  ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.458337]  ? 
__check_object_size+0xb3/0x190
Oct 27 13:19:40 webserver2 kernel: [790358.458339]  ? 
__alloc_fd+0x44/0x170
Oct 27 13:19:40 webserver2 kernel: [790358.458341]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.458342]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.458344]  ? 
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:19:40 webserver2 kernel: [790358.461224] php-fpm7.0      D    
0  8703   8654 0x00000000
Oct 27 13:19:40 webserver2 kernel: [790358.461226] Call Trace:
Oct 27 13:19:40 webserver2 kernel: [790358.461230]  ? 
__schedule+0x3c8/0x860
Oct 27 13:19:40 webserver2 kernel: [790358.461233]  ? schedule+0x32/0x80
Oct 27 13:19:40 webserver2 kernel: [790358.461235]  ? 
rwsem_down_write_failed+0x232/0x410
Oct 27 13:19:40 webserver2 kernel: [790358.461237]  ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.461239]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.461241]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.461243]  ? 
down_write+0x29/0x40
Oct 27 13:19:40 webserver2 kernel: [790358.461245]  ? 
path_openat+0x3dc/0x1440
Oct 27 13:19:40 webserver2 kernel: [790358.461280]  ? 
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:19:40 webserver2 kernel: [790358.461282]  ? 
do_filp_open+0x99/0x110
Oct 27 13:19:40 webserver2 kernel: [790358.461284]  ? 
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:19:40 webserver2 kernel: [790358.461286]  ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.461287]  ? 
__check_object_size+0xb3/0x190
Oct 27 13:19:40 webserver2 kernel: [790358.461289]  ? 
__alloc_fd+0x44/0x170
Oct 27 13:19:40 webserver2 kernel: [790358.461291]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.461292]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.461294]  ? 
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:21:40 webserver2 kernel: [790479.282565] php-fpm7.0      D    
0  8659   8654 0x00000000
Oct 27 13:21:40 webserver2 kernel: [790479.282568] Call Trace:
Oct 27 13:21:40 webserver2 kernel: [790479.282580]  ? 
__schedule+0x3c8/0x860
Oct 27 13:21:40 webserver2 kernel: [790479.282583]  ? schedule+0x32/0x80
Oct 27 13:21:40 webserver2 kernel: [790479.282587]  ? 
rwsem_down_write_failed+0x232/0x410
Oct 27 13:21:40 webserver2 kernel: [790479.282590]  ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.282594]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.282596]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.282598]  ? 
down_write+0x29/0x40
Oct 27 13:21:40 webserver2 kernel: [790479.282601]  ? 
path_openat+0x3dc/0x1440
Oct 27 13:21:40 webserver2 kernel: [790479.282692]  ? 
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:21:40 webserver2 kernel: [790479.282695]  ? 
do_filp_open+0x99/0x110
Oct 27 13:21:40 webserver2 kernel: [790479.282698]  ? 
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:21:40 webserver2 kernel: [790479.282700]  ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.282702]  ? 
__check_object_size+0xb3/0x190
Oct 27 13:21:40 webserver2 kernel: [790479.282705]  ? 
__alloc_fd+0x44/0x170
Oct 27 13:21:40 webserver2 kernel: [790479.282707]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.282709]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.282711]  ? 
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:21:40 webserver2 kernel: [790479.282852] php-fpm7.0      D    
0  8661   8654 0x00000000
Oct 27 13:21:40 webserver2 kernel: [790479.282854] Call Trace:
Oct 27 13:21:40 webserver2 kernel: [790479.282857]  ? 
__schedule+0x3c8/0x860
Oct 27 13:21:40 webserver2 kernel: [790479.282859]  ? schedule+0x32/0x80
Oct 27 13:21:40 webserver2 kernel: [790479.282861]  ? 
rwsem_down_write_failed+0x232/0x410
Oct 27 13:21:40 webserver2 kernel: [790479.282862]  ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.282865]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.282867]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.282869]  ? 
down_write+0x29/0x40
Oct 27 13:21:40 webserver2 kernel: [790479.282871]  ? 
path_openat+0x3dc/0x1440
Oct 27 13:21:40 webserver2 kernel: [790479.282895]  ? 
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:21:40 webserver2 kernel: [790479.282897]  ? 
do_filp_open+0x99/0x110
Oct 27 13:21:40 webserver2 kernel: [790479.282899]  ? 
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:21:40 webserver2 kernel: [790479.282901]  ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.282903]  ? 
__check_object_size+0xb3/0x190
Oct 27 13:21:40 webserver2 kernel: [790479.282904]  ? 
__alloc_fd+0x44/0x170
Oct 27 13:21:40 webserver2 kernel: [790479.282906]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.282907]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.282909]  ? 
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:21:40 webserver2 kernel: [790479.283060] php-fpm7.0      D    
0  8662   8654 0x00000000
Oct 27 13:21:40 webserver2 kernel: [790479.283062] Call Trace:
Oct 27 13:21:40 webserver2 kernel: [790479.283065]  ? 
__schedule+0x3c8/0x860
Oct 27 13:21:40 webserver2 kernel: [790479.283067]  ? schedule+0x32/0x80
Oct 27 13:21:40 webserver2 kernel: [790479.283069]  ? 
rwsem_down_write_failed+0x232/0x410
Oct 27 13:21:40 webserver2 kernel: [790479.283071]  ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.283073]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.283077]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.283079]  ? 
down_write+0x29/0x40
Oct 27 13:21:40 webserver2 kernel: [790479.283081]  ? 
path_openat+0x3dc/0x1440
Oct 27 13:21:40 webserver2 kernel: [790479.283109]  ? 
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:21:40 webserver2 kernel: [790479.283111]  ? 
do_filp_open+0x99/0x110
Oct 27 13:21:40 webserver2 kernel: [790479.283113]  ? 
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:21:40 webserver2 kernel: [790479.283114]  ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.283116]  ? 
__check_object_size+0xb3/0x190
Oct 27 13:21:40 webserver2 kernel: [790479.283118]  ? 
__alloc_fd+0x44/0x170
Oct 27 13:21:40 webserver2 kernel: [790479.283119]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.283121]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.283122]  ? 
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:21:40 webserver2 kernel: [790479.284496] php-fpm7.0      D    
0  8700   8654 0x00000000
Oct 27 13:21:40 webserver2 kernel: [790479.284499] Call Trace:
Oct 27 13:21:40 webserver2 kernel: [790479.284503]  ? 
__schedule+0x3c8/0x860
Oct 27 13:21:40 webserver2 kernel: [790479.284505]  ? schedule+0x32/0x80
Oct 27 13:21:40 webserver2 kernel: [790479.284507]  ? 
rwsem_down_write_failed+0x232/0x410
Oct 27 13:21:40 webserver2 kernel: [790479.284509]  ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.284512]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.284514]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.284516]  ? 
down_write+0x29/0x40
Oct 27 13:21:40 webserver2 kernel: [790479.284518]  ? 
path_openat+0x3dc/0x1440
Oct 27 13:21:40 webserver2 kernel: [790479.284557]  ? 
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:21:40 webserver2 kernel: [790479.284559]  ? 
do_filp_open+0x99/0x110
Oct 27 13:21:40 webserver2 kernel: [790479.284561]  ? 
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:21:40 webserver2 kernel: [790479.284563]  ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.284565]  ? 
__check_object_size+0xb3/0x190
Oct 27 13:21:40 webserver2 kernel: [790479.284566]  ? 
__alloc_fd+0x44/0x170
Oct 27 13:21:40 webserver2 kernel: [790479.284568]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.284569]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.284571]  ? 
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:21:40 webserver2 kernel: [790479.288370] php-fpm7.0      D    
0  8703   8654 0x00000000
Oct 27 13:21:40 webserver2 kernel: [790479.288372] Call Trace:
Oct 27 13:21:40 webserver2 kernel: [790479.288377]  ? 
__schedule+0x3c8/0x860
Oct 27 13:21:40 webserver2 kernel: [790479.288380]  ? schedule+0x32/0x80
Oct 27 13:21:40 webserver2 kernel: [790479.288382]  ? 
rwsem_down_write_failed+0x232/0x410
Oct 27 13:21:40 webserver2 kernel: [790479.288384]  ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.288387]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.288389]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.288392]  ? 
down_write+0x29/0x40
Oct 27 13:21:40 webserver2 kernel: [790479.288394]  ? 
path_openat+0x3dc/0x1440
Oct 27 13:21:40 webserver2 kernel: [790479.288433]  ? 
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:21:40 webserver2 kernel: [790479.288436]  ? 
do_filp_open+0x99/0x110
Oct 27 13:21:40 webserver2 kernel: [790479.288439]  ? 
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:21:40 webserver2 kernel: [790479.288440]  ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.288442]  ? 
__check_object_size+0xb3/0x190
Oct 27 13:21:40 webserver2 kernel: [790479.288445]  ? 
__alloc_fd+0x44/0x170
Oct 27 13:21:40 webserver2 kernel: [790479.288447]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.288449]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.288450]  ? 
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:23:41 webserver2 kernel: [790600.113898] php-fpm7.0      D    
0  8659   8654 0x00000000
Oct 27 13:23:41 webserver2 kernel: [790600.113901] Call Trace:
Oct 27 13:23:41 webserver2 kernel: [790600.113912]  ? 
__schedule+0x3c8/0x860
Oct 27 13:23:41 webserver2 kernel: [790600.113915]  ? schedule+0x32/0x80
Oct 27 13:23:41 webserver2 kernel: [790600.113918]  ? 
rwsem_down_write_failed+0x232/0x410
Oct 27 13:23:41 webserver2 kernel: [790600.113922]  ? dput+0x2f/0x1f0
Oct 27 13:23:41 webserver2 kernel: [790600.113926]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:23:41 webserver2 kernel: [790600.113928]  ? 
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:23:41 webserver2 kernel: [790600.113929]  ? 
down_write+0x29/0x40
Oct 27 13:23:41 webserver2 kernel: [790600.113933]  ? 
path_openat+0x3dc/0x1440
Oct 27 13:23:41 webserver2 kernel: [790600.114006]  ? 
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:23:41 webserver2 kernel: [790600.114008]  ? 
do_filp_open+0x99/0x110
Oct 27 13:23:41 webserver2 kernel: [790600.114012]  ? 
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:23:41 webserver2 kernel: [790600.114013]  ? dput+0x2f/0x1f0
Oct 27 13:23:41 webserver2 kernel: [790600.114016]  ? 
__check_object_size+0xb3/0x190
Oct 27 13:23:41 webserver2 kernel: [790600.114019]  ? 
__alloc_fd+0x44/0x170
Oct 27 13:23:41 webserver2 kernel: [790600.114021]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:23:41 webserver2 kernel: [790600.114023]  ? 
do_sys_open+0x12e/0x210
Oct 27 13:23:41 webserver2 kernel: [790600.114025]  ? 
entry_SYSCALL_64_fastpath+0x1e/0xa9


Any clues what is causing this?

Thanks!

On 2017-09-29 08:46, Gang He wrote:
> Hello netbsd,
> 
> Could you conclude to a way to trigger this crash happen in a normal
> ocfs2 cluster?
> e.g. reproduce steps, or a shell script.
> 
> Thanks
> Gang
> 
> 
>>>> 
>> Hello,
>> 
>> Find the full log below:
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__paste.ubuntu.com_25625787_&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtYn-0afBpa7A&m=LIe0FuKdHS00KQDpalNr3sC8x4IUbJAxr9ZbKkaVVRU&s=VPI6eV6Mfe3WqNRd1ik-Qgx2TrRcv_1mICCopkeXvm4&e=
>>  
>> 
>> VM was restarted at 9:27 and no problem since then. We are rsyncing
>> about 2TB data (a lot of small files) between 2 OCFS shares on the 
>> same
>> vm:
>> 
>> 
>> /dev/vdc                      4.8T  2.8T  2.1T  58% /mnt/s1
>> /dev/vdf                      4.8T  985G  3.9T  21% /mnt/s2
>> 
>> rsync -av --numeric-ids --delete /mnt/s1/ /mnt/s2/
>> 
>> 
>> On 2017-09-27 10:53, Gang He wrote:
>>> Hello netbsd,
>>> 
>>> The ocfs2 project is still be developed by us (from SUE, Huawei,
>>> Oracle and H3C. etc.).
>>> If you encountered some problem, please send the mail to ocfs2-devel
>>> mail list, we usually watch that mail for ocfs2 kernel related 
>>> issues.
>>> 
>>> 
>>> 
>>> 
>>>>>> 
>>>> Hello All,
>>>> 
>>>> I wrote earlier about our OCFS2 crash issue in KVM due to bug in the
>>>> SMP
>>>> code.
>>>> 
>>>> For this we come up with a solution:
>>>> 
>>>> Instead of using multiple vcpus
>>>>    <vcpu placement='static'>8</vcpu>
>>>> 
>>>> using a single one and multiple cores instead:
>>>>      <topology sockets='8' cores='8' threads='1'/>
>>>> 
>>>> And applying key tune options to sysctl.conf:
>>>> 
>>>> vm.min_free_kbytes=131072
>>>> vm.zone_reclaim_mode=1
>>>> 
>>>> Seemed to be helped, the fs did not crash right away when we were
>>>> hammering it with apache benchmarks with 10000 requests however last
>>>> night I started a large rsync operation from a 5TB OCFS2 FS mounted 
>>>> in
>>>> the VM to another OCFS2 mounted in the same VM and ended up with:
>>>> 
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_gFeGg5&d=DwICAg&c=R
>>>> 
>> oP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtY
>> 
>>>> 
>> n-0afBpa7A&m=cYprGRHz-oQmhnx4HIke8sTdCG_tf8Jb-rF6sHnYLnk&s=ajWfQIlUZOpElFWxoKcmvTI
>> 
>>>> k7J3PpuCJITcnXfJQHrc&e=
>>> From the kernel crash backtrace, this problem should be that long 
>>> time
>>> to acquiring spin_lock triggers a NMI interruption.
>>> Could you give a detailed reproduce steps? since we want to reproduce
>>> this issue in local, then try to fix it.
>>> 
>>> 
>>> Thanks
>>> Gang
>>> 
>>>> 
>>>> After trying a lot of different kernels starting from the 3.x 
>>>> series,
>>>> now we are using 4.13.2 latest kernel with default configuration but
>>>> these issues still present. Is this OCFS2 project still being
>>>> developed?
>>>> With this crashing and unreliability it cannot be used in production
>>>> unless you put in place bunch of safeguards to reset out the whole
>>>> virtualmachine when it crashes.
>>>> 
>>>> Thanks
>>>> 
>>>> _______________________________________________
>>>> Ocfs2-users mailing list
>>>> Ocfs2-users@oss.oracle.com
>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to