Hello, So we have created the new cluster finally 3 identical KVMs:
-8 vCPUs -10GB ram per node -Kernel custom 4.13.2OCFS -All the 3 VMs running on a dell host server which have more than enough resources so network connection between the VMs cannot be an issue yet (we will move them to separate physical servers once they become rock solid) Until 9 days it was running fine until Today one of the webservers decided to crash on OCFS2 again. Here is the picture of the crashed server: https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_kxSqLm&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtYn-0afBpa7A&m=LIe0FuKdHS00KQDpalNr3sC8x4IUbJAxr9ZbKkaVVRU&s=CU8iQ7bMjz3onn2KVgChw_n06syWA6OAYpbd1hl6mfw&e= And the log from the other nodes: Oct 27 13:11:06 webserver2 kernel: [789844.406061] o2net: Connection to node webserver3 (num 2) at 10.0.0.247:7777 has been idle for 30.688 secs. Oct 27 13:11:36 webserver2 kernel: [789875.125863] o2net: Connection to node webserver3 (num 2) at 10.0.0.247:7777 has been idle for 30.720 secs. Oct 27 13:11:40 webserver2 kernel: [789878.935510] o2net: No longer connected to node webserver3 (num 2) at 10.0.0.247:7777 Oct 27 13:11:40 webserver2 kernel: [789878.935924] o2cb: o2dlm has evicted node 2 from domain 428503AACBAA492D84DFA48C5CF305B4 Oct 27 13:11:40 webserver2 kernel: [789879.050040] o2cb: o2dlm has evicted node 2 from domain E6CEF44C077640538468D6FCD1E27C5F Oct 27 13:11:41 webserver2 kernel: [789880.245846] o2dlm: Begin recovery on domain 428503AACBAA492D84DFA48C5CF305B4 for node 2 Oct 27 13:11:41 webserver2 kernel: [789880.246863] o2dlm: Node 1 (me) is the Recovery Master for the dead node 2 in domain 428503AACBAA492D84DFA48C5CF305B4 Oct 27 13:11:41 webserver2 kernel: [789880.325817] o2dlm: End recovery on domain 428503AACBAA492D84DFA48C5CF305B4 Oct 27 13:11:42 webserver2 kernel: [789880.501802] o2dlm: Begin recovery on domain E6CEF44C077640538468D6FCD1E27C5F for node 2 Oct 27 13:11:42 webserver2 kernel: [789880.502841] o2dlm: Node 1 (me) is the Recovery Master for the dead node 2 in domain E6CEF44C077640538468D6FCD1E27C5F Oct 27 13:11:47 webserver2 kernel: [789885.629843] o2dlm: End recovery on domain E6CEF44C077640538468D6FCD1E27C5F Oct 27 13:11:47 webserver2 kernel: [789885.684062] ocfs2: Begin replay journal (node 2, slot 1) on device (254,64) Oct 27 13:11:47 webserver2 kernel: [789885.707354] ocfs2: End replay journal (node 2, slot 1) on device (254,64) Oct 27 13:11:47 webserver2 kernel: [789885.737907] ocfs2: Beginning quota recovery on device (254,64) for slot 1 Oct 27 13:11:47 webserver2 kernel: [789885.757285] ocfs2: Finishing quota recovery on device (254,64) for slot 1 Oct 27 13:19:40 webserver2 kernel: [790358.453142] php-fpm7.0 D 0 8659 8654 0x00000000 Oct 27 13:19:40 webserver2 kernel: [790358.453145] Call Trace: Oct 27 13:19:40 webserver2 kernel: [790358.453153] ? __schedule+0x3c8/0x860 Oct 27 13:19:40 webserver2 kernel: [790358.453155] ? schedule+0x32/0x80 Oct 27 13:19:40 webserver2 kernel: [790358.453158] ? rwsem_down_write_failed+0x232/0x410 Oct 27 13:19:40 webserver2 kernel: [790358.453160] ? dput+0x2f/0x1f0 Oct 27 13:19:40 webserver2 kernel: [790358.453164] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:19:40 webserver2 kernel: [790358.453165] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:19:40 webserver2 kernel: [790358.453167] ? down_write+0x29/0x40 Oct 27 13:19:40 webserver2 kernel: [790358.453170] ? path_openat+0x3dc/0x1440 Oct 27 13:19:40 webserver2 kernel: [790358.453227] ? ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2] Oct 27 13:19:40 webserver2 kernel: [790358.453230] ? do_filp_open+0x99/0x110 Oct 27 13:19:40 webserver2 kernel: [790358.453232] ? kmem_cache_alloc+0x11a/0x5a0 Oct 27 13:19:40 webserver2 kernel: [790358.453234] ? dput+0x2f/0x1f0 Oct 27 13:19:40 webserver2 kernel: [790358.453236] ? __check_object_size+0xb3/0x190 Oct 27 13:19:40 webserver2 kernel: [790358.453238] ? __alloc_fd+0x44/0x170 Oct 27 13:19:40 webserver2 kernel: [790358.453240] ? do_sys_open+0x12e/0x210 Oct 27 13:19:40 webserver2 kernel: [790358.453241] ? do_sys_open+0x12e/0x210 Oct 27 13:19:40 webserver2 kernel: [790358.453243] ? entry_SYSCALL_64_fastpath+0x1e/0xa9 Oct 27 13:19:40 webserver2 kernel: [790358.455597] php-fpm7.0 D 0 8662 8654 0x00000000 Oct 27 13:19:40 webserver2 kernel: [790358.455624] Call Trace: Oct 27 13:19:40 webserver2 kernel: [790358.455628] ? __schedule+0x3c8/0x860 Oct 27 13:19:40 webserver2 kernel: [790358.455630] ? schedule+0x32/0x80 Oct 27 13:19:40 webserver2 kernel: [790358.455632] ? rwsem_down_write_failed+0x232/0x410 Oct 27 13:19:40 webserver2 kernel: [790358.455634] ? dput+0x2f/0x1f0 Oct 27 13:19:40 webserver2 kernel: [790358.455637] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:19:40 webserver2 kernel: [790358.455639] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:19:40 webserver2 kernel: [790358.455640] ? down_write+0x29/0x40 Oct 27 13:19:40 webserver2 kernel: [790358.455642] ? path_openat+0x3dc/0x1440 Oct 27 13:19:40 webserver2 kernel: [790358.455678] ? ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2] Oct 27 13:19:40 webserver2 kernel: [790358.455680] ? do_filp_open+0x99/0x110 Oct 27 13:19:40 webserver2 kernel: [790358.455682] ? kmem_cache_alloc+0x11a/0x5a0 Oct 27 13:19:40 webserver2 kernel: [790358.455696] ? dput+0x2f/0x1f0 Oct 27 13:19:40 webserver2 kernel: [790358.455698] ? __check_object_size+0xb3/0x190 Oct 27 13:19:40 webserver2 kernel: [790358.455700] ? __alloc_fd+0x44/0x170 Oct 27 13:19:40 webserver2 kernel: [790358.455702] ? do_sys_open+0x12e/0x210 Oct 27 13:19:40 webserver2 kernel: [790358.455704] ? do_sys_open+0x12e/0x210 Oct 27 13:19:40 webserver2 kernel: [790358.455706] ? entry_SYSCALL_64_fastpath+0x1e/0xa9 Oct 27 13:19:40 webserver2 kernel: [790358.458274] php-fpm7.0 D 0 8700 8654 0x00000000 Oct 27 13:19:40 webserver2 kernel: [790358.458277] Call Trace: Oct 27 13:19:40 webserver2 kernel: [790358.458280] ? __schedule+0x3c8/0x860 Oct 27 13:19:40 webserver2 kernel: [790358.458282] ? schedule+0x32/0x80 Oct 27 13:19:40 webserver2 kernel: [790358.458284] ? rwsem_down_write_failed+0x232/0x410 Oct 27 13:19:40 webserver2 kernel: [790358.458286] ? dput+0x2f/0x1f0 Oct 27 13:19:40 webserver2 kernel: [790358.458289] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:19:40 webserver2 kernel: [790358.458290] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:19:40 webserver2 kernel: [790358.458292] ? down_write+0x29/0x40 Oct 27 13:19:40 webserver2 kernel: [790358.458294] ? path_openat+0x3dc/0x1440 Oct 27 13:19:40 webserver2 kernel: [790358.458330] ? ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2] Oct 27 13:19:40 webserver2 kernel: [790358.458332] ? do_filp_open+0x99/0x110 Oct 27 13:19:40 webserver2 kernel: [790358.458334] ? kmem_cache_alloc+0x11a/0x5a0 Oct 27 13:19:40 webserver2 kernel: [790358.458336] ? dput+0x2f/0x1f0 Oct 27 13:19:40 webserver2 kernel: [790358.458337] ? __check_object_size+0xb3/0x190 Oct 27 13:19:40 webserver2 kernel: [790358.458339] ? __alloc_fd+0x44/0x170 Oct 27 13:19:40 webserver2 kernel: [790358.458341] ? do_sys_open+0x12e/0x210 Oct 27 13:19:40 webserver2 kernel: [790358.458342] ? do_sys_open+0x12e/0x210 Oct 27 13:19:40 webserver2 kernel: [790358.458344] ? entry_SYSCALL_64_fastpath+0x1e/0xa9 Oct 27 13:19:40 webserver2 kernel: [790358.461224] php-fpm7.0 D 0 8703 8654 0x00000000 Oct 27 13:19:40 webserver2 kernel: [790358.461226] Call Trace: Oct 27 13:19:40 webserver2 kernel: [790358.461230] ? __schedule+0x3c8/0x860 Oct 27 13:19:40 webserver2 kernel: [790358.461233] ? schedule+0x32/0x80 Oct 27 13:19:40 webserver2 kernel: [790358.461235] ? rwsem_down_write_failed+0x232/0x410 Oct 27 13:19:40 webserver2 kernel: [790358.461237] ? dput+0x2f/0x1f0 Oct 27 13:19:40 webserver2 kernel: [790358.461239] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:19:40 webserver2 kernel: [790358.461241] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:19:40 webserver2 kernel: [790358.461243] ? down_write+0x29/0x40 Oct 27 13:19:40 webserver2 kernel: [790358.461245] ? path_openat+0x3dc/0x1440 Oct 27 13:19:40 webserver2 kernel: [790358.461280] ? ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2] Oct 27 13:19:40 webserver2 kernel: [790358.461282] ? do_filp_open+0x99/0x110 Oct 27 13:19:40 webserver2 kernel: [790358.461284] ? kmem_cache_alloc+0x11a/0x5a0 Oct 27 13:19:40 webserver2 kernel: [790358.461286] ? dput+0x2f/0x1f0 Oct 27 13:19:40 webserver2 kernel: [790358.461287] ? __check_object_size+0xb3/0x190 Oct 27 13:19:40 webserver2 kernel: [790358.461289] ? __alloc_fd+0x44/0x170 Oct 27 13:19:40 webserver2 kernel: [790358.461291] ? do_sys_open+0x12e/0x210 Oct 27 13:19:40 webserver2 kernel: [790358.461292] ? do_sys_open+0x12e/0x210 Oct 27 13:19:40 webserver2 kernel: [790358.461294] ? entry_SYSCALL_64_fastpath+0x1e/0xa9 Oct 27 13:21:40 webserver2 kernel: [790479.282565] php-fpm7.0 D 0 8659 8654 0x00000000 Oct 27 13:21:40 webserver2 kernel: [790479.282568] Call Trace: Oct 27 13:21:40 webserver2 kernel: [790479.282580] ? __schedule+0x3c8/0x860 Oct 27 13:21:40 webserver2 kernel: [790479.282583] ? schedule+0x32/0x80 Oct 27 13:21:40 webserver2 kernel: [790479.282587] ? rwsem_down_write_failed+0x232/0x410 Oct 27 13:21:40 webserver2 kernel: [790479.282590] ? dput+0x2f/0x1f0 Oct 27 13:21:40 webserver2 kernel: [790479.282594] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:21:40 webserver2 kernel: [790479.282596] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:21:40 webserver2 kernel: [790479.282598] ? down_write+0x29/0x40 Oct 27 13:21:40 webserver2 kernel: [790479.282601] ? path_openat+0x3dc/0x1440 Oct 27 13:21:40 webserver2 kernel: [790479.282692] ? ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2] Oct 27 13:21:40 webserver2 kernel: [790479.282695] ? do_filp_open+0x99/0x110 Oct 27 13:21:40 webserver2 kernel: [790479.282698] ? kmem_cache_alloc+0x11a/0x5a0 Oct 27 13:21:40 webserver2 kernel: [790479.282700] ? dput+0x2f/0x1f0 Oct 27 13:21:40 webserver2 kernel: [790479.282702] ? __check_object_size+0xb3/0x190 Oct 27 13:21:40 webserver2 kernel: [790479.282705] ? __alloc_fd+0x44/0x170 Oct 27 13:21:40 webserver2 kernel: [790479.282707] ? do_sys_open+0x12e/0x210 Oct 27 13:21:40 webserver2 kernel: [790479.282709] ? do_sys_open+0x12e/0x210 Oct 27 13:21:40 webserver2 kernel: [790479.282711] ? entry_SYSCALL_64_fastpath+0x1e/0xa9 Oct 27 13:21:40 webserver2 kernel: [790479.282852] php-fpm7.0 D 0 8661 8654 0x00000000 Oct 27 13:21:40 webserver2 kernel: [790479.282854] Call Trace: Oct 27 13:21:40 webserver2 kernel: [790479.282857] ? __schedule+0x3c8/0x860 Oct 27 13:21:40 webserver2 kernel: [790479.282859] ? schedule+0x32/0x80 Oct 27 13:21:40 webserver2 kernel: [790479.282861] ? rwsem_down_write_failed+0x232/0x410 Oct 27 13:21:40 webserver2 kernel: [790479.282862] ? dput+0x2f/0x1f0 Oct 27 13:21:40 webserver2 kernel: [790479.282865] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:21:40 webserver2 kernel: [790479.282867] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:21:40 webserver2 kernel: [790479.282869] ? down_write+0x29/0x40 Oct 27 13:21:40 webserver2 kernel: [790479.282871] ? path_openat+0x3dc/0x1440 Oct 27 13:21:40 webserver2 kernel: [790479.282895] ? ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2] Oct 27 13:21:40 webserver2 kernel: [790479.282897] ? do_filp_open+0x99/0x110 Oct 27 13:21:40 webserver2 kernel: [790479.282899] ? kmem_cache_alloc+0x11a/0x5a0 Oct 27 13:21:40 webserver2 kernel: [790479.282901] ? dput+0x2f/0x1f0 Oct 27 13:21:40 webserver2 kernel: [790479.282903] ? __check_object_size+0xb3/0x190 Oct 27 13:21:40 webserver2 kernel: [790479.282904] ? __alloc_fd+0x44/0x170 Oct 27 13:21:40 webserver2 kernel: [790479.282906] ? do_sys_open+0x12e/0x210 Oct 27 13:21:40 webserver2 kernel: [790479.282907] ? do_sys_open+0x12e/0x210 Oct 27 13:21:40 webserver2 kernel: [790479.282909] ? entry_SYSCALL_64_fastpath+0x1e/0xa9 Oct 27 13:21:40 webserver2 kernel: [790479.283060] php-fpm7.0 D 0 8662 8654 0x00000000 Oct 27 13:21:40 webserver2 kernel: [790479.283062] Call Trace: Oct 27 13:21:40 webserver2 kernel: [790479.283065] ? __schedule+0x3c8/0x860 Oct 27 13:21:40 webserver2 kernel: [790479.283067] ? schedule+0x32/0x80 Oct 27 13:21:40 webserver2 kernel: [790479.283069] ? rwsem_down_write_failed+0x232/0x410 Oct 27 13:21:40 webserver2 kernel: [790479.283071] ? dput+0x2f/0x1f0 Oct 27 13:21:40 webserver2 kernel: [790479.283073] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:21:40 webserver2 kernel: [790479.283077] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:21:40 webserver2 kernel: [790479.283079] ? down_write+0x29/0x40 Oct 27 13:21:40 webserver2 kernel: [790479.283081] ? path_openat+0x3dc/0x1440 Oct 27 13:21:40 webserver2 kernel: [790479.283109] ? ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2] Oct 27 13:21:40 webserver2 kernel: [790479.283111] ? do_filp_open+0x99/0x110 Oct 27 13:21:40 webserver2 kernel: [790479.283113] ? kmem_cache_alloc+0x11a/0x5a0 Oct 27 13:21:40 webserver2 kernel: [790479.283114] ? dput+0x2f/0x1f0 Oct 27 13:21:40 webserver2 kernel: [790479.283116] ? __check_object_size+0xb3/0x190 Oct 27 13:21:40 webserver2 kernel: [790479.283118] ? __alloc_fd+0x44/0x170 Oct 27 13:21:40 webserver2 kernel: [790479.283119] ? do_sys_open+0x12e/0x210 Oct 27 13:21:40 webserver2 kernel: [790479.283121] ? do_sys_open+0x12e/0x210 Oct 27 13:21:40 webserver2 kernel: [790479.283122] ? entry_SYSCALL_64_fastpath+0x1e/0xa9 Oct 27 13:21:40 webserver2 kernel: [790479.284496] php-fpm7.0 D 0 8700 8654 0x00000000 Oct 27 13:21:40 webserver2 kernel: [790479.284499] Call Trace: Oct 27 13:21:40 webserver2 kernel: [790479.284503] ? __schedule+0x3c8/0x860 Oct 27 13:21:40 webserver2 kernel: [790479.284505] ? schedule+0x32/0x80 Oct 27 13:21:40 webserver2 kernel: [790479.284507] ? rwsem_down_write_failed+0x232/0x410 Oct 27 13:21:40 webserver2 kernel: [790479.284509] ? dput+0x2f/0x1f0 Oct 27 13:21:40 webserver2 kernel: [790479.284512] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:21:40 webserver2 kernel: [790479.284514] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:21:40 webserver2 kernel: [790479.284516] ? down_write+0x29/0x40 Oct 27 13:21:40 webserver2 kernel: [790479.284518] ? path_openat+0x3dc/0x1440 Oct 27 13:21:40 webserver2 kernel: [790479.284557] ? ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2] Oct 27 13:21:40 webserver2 kernel: [790479.284559] ? do_filp_open+0x99/0x110 Oct 27 13:21:40 webserver2 kernel: [790479.284561] ? kmem_cache_alloc+0x11a/0x5a0 Oct 27 13:21:40 webserver2 kernel: [790479.284563] ? dput+0x2f/0x1f0 Oct 27 13:21:40 webserver2 kernel: [790479.284565] ? __check_object_size+0xb3/0x190 Oct 27 13:21:40 webserver2 kernel: [790479.284566] ? __alloc_fd+0x44/0x170 Oct 27 13:21:40 webserver2 kernel: [790479.284568] ? do_sys_open+0x12e/0x210 Oct 27 13:21:40 webserver2 kernel: [790479.284569] ? do_sys_open+0x12e/0x210 Oct 27 13:21:40 webserver2 kernel: [790479.284571] ? entry_SYSCALL_64_fastpath+0x1e/0xa9 Oct 27 13:21:40 webserver2 kernel: [790479.288370] php-fpm7.0 D 0 8703 8654 0x00000000 Oct 27 13:21:40 webserver2 kernel: [790479.288372] Call Trace: Oct 27 13:21:40 webserver2 kernel: [790479.288377] ? __schedule+0x3c8/0x860 Oct 27 13:21:40 webserver2 kernel: [790479.288380] ? schedule+0x32/0x80 Oct 27 13:21:40 webserver2 kernel: [790479.288382] ? rwsem_down_write_failed+0x232/0x410 Oct 27 13:21:40 webserver2 kernel: [790479.288384] ? dput+0x2f/0x1f0 Oct 27 13:21:40 webserver2 kernel: [790479.288387] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:21:40 webserver2 kernel: [790479.288389] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:21:40 webserver2 kernel: [790479.288392] ? down_write+0x29/0x40 Oct 27 13:21:40 webserver2 kernel: [790479.288394] ? path_openat+0x3dc/0x1440 Oct 27 13:21:40 webserver2 kernel: [790479.288433] ? ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2] Oct 27 13:21:40 webserver2 kernel: [790479.288436] ? do_filp_open+0x99/0x110 Oct 27 13:21:40 webserver2 kernel: [790479.288439] ? kmem_cache_alloc+0x11a/0x5a0 Oct 27 13:21:40 webserver2 kernel: [790479.288440] ? dput+0x2f/0x1f0 Oct 27 13:21:40 webserver2 kernel: [790479.288442] ? __check_object_size+0xb3/0x190 Oct 27 13:21:40 webserver2 kernel: [790479.288445] ? __alloc_fd+0x44/0x170 Oct 27 13:21:40 webserver2 kernel: [790479.288447] ? do_sys_open+0x12e/0x210 Oct 27 13:21:40 webserver2 kernel: [790479.288449] ? do_sys_open+0x12e/0x210 Oct 27 13:21:40 webserver2 kernel: [790479.288450] ? entry_SYSCALL_64_fastpath+0x1e/0xa9 Oct 27 13:23:41 webserver2 kernel: [790600.113898] php-fpm7.0 D 0 8659 8654 0x00000000 Oct 27 13:23:41 webserver2 kernel: [790600.113901] Call Trace: Oct 27 13:23:41 webserver2 kernel: [790600.113912] ? __schedule+0x3c8/0x860 Oct 27 13:23:41 webserver2 kernel: [790600.113915] ? schedule+0x32/0x80 Oct 27 13:23:41 webserver2 kernel: [790600.113918] ? rwsem_down_write_failed+0x232/0x410 Oct 27 13:23:41 webserver2 kernel: [790600.113922] ? dput+0x2f/0x1f0 Oct 27 13:23:41 webserver2 kernel: [790600.113926] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:23:41 webserver2 kernel: [790600.113928] ? call_rwsem_down_write_failed+0x13/0x20 Oct 27 13:23:41 webserver2 kernel: [790600.113929] ? down_write+0x29/0x40 Oct 27 13:23:41 webserver2 kernel: [790600.113933] ? path_openat+0x3dc/0x1440 Oct 27 13:23:41 webserver2 kernel: [790600.114006] ? ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2] Oct 27 13:23:41 webserver2 kernel: [790600.114008] ? do_filp_open+0x99/0x110 Oct 27 13:23:41 webserver2 kernel: [790600.114012] ? kmem_cache_alloc+0x11a/0x5a0 Oct 27 13:23:41 webserver2 kernel: [790600.114013] ? dput+0x2f/0x1f0 Oct 27 13:23:41 webserver2 kernel: [790600.114016] ? __check_object_size+0xb3/0x190 Oct 27 13:23:41 webserver2 kernel: [790600.114019] ? __alloc_fd+0x44/0x170 Oct 27 13:23:41 webserver2 kernel: [790600.114021] ? do_sys_open+0x12e/0x210 Oct 27 13:23:41 webserver2 kernel: [790600.114023] ? do_sys_open+0x12e/0x210 Oct 27 13:23:41 webserver2 kernel: [790600.114025] ? entry_SYSCALL_64_fastpath+0x1e/0xa9 Any clues what is causing this? Thanks! On 2017-09-29 08:46, Gang He wrote: > Hello netbsd, > > Could you conclude to a way to trigger this crash happen in a normal > ocfs2 cluster? > e.g. reproduce steps, or a shell script. > > Thanks > Gang > > >>>> >> Hello, >> >> Find the full log below: >> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__paste.ubuntu.com_25625787_&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtYn-0afBpa7A&m=LIe0FuKdHS00KQDpalNr3sC8x4IUbJAxr9ZbKkaVVRU&s=VPI6eV6Mfe3WqNRd1ik-Qgx2TrRcv_1mICCopkeXvm4&e= >> >> >> VM was restarted at 9:27 and no problem since then. We are rsyncing >> about 2TB data (a lot of small files) between 2 OCFS shares on the >> same >> vm: >> >> >> /dev/vdc 4.8T 2.8T 2.1T 58% /mnt/s1 >> /dev/vdf 4.8T 985G 3.9T 21% /mnt/s2 >> >> rsync -av --numeric-ids --delete /mnt/s1/ /mnt/s2/ >> >> >> On 2017-09-27 10:53, Gang He wrote: >>> Hello netbsd, >>> >>> The ocfs2 project is still be developed by us (from SUE, Huawei, >>> Oracle and H3C. etc.). >>> If you encountered some problem, please send the mail to ocfs2-devel >>> mail list, we usually watch that mail for ocfs2 kernel related >>> issues. >>> >>> >>> >>> >>>>>> >>>> Hello All, >>>> >>>> I wrote earlier about our OCFS2 crash issue in KVM due to bug in the >>>> SMP >>>> code. >>>> >>>> For this we come up with a solution: >>>> >>>> Instead of using multiple vcpus >>>> <vcpu placement='static'>8</vcpu> >>>> >>>> using a single one and multiple cores instead: >>>> <topology sockets='8' cores='8' threads='1'/> >>>> >>>> And applying key tune options to sysctl.conf: >>>> >>>> vm.min_free_kbytes=131072 >>>> vm.zone_reclaim_mode=1 >>>> >>>> Seemed to be helped, the fs did not crash right away when we were >>>> hammering it with apache benchmarks with 10000 requests however last >>>> night I started a large rsync operation from a 5TB OCFS2 FS mounted >>>> in >>>> the VM to another OCFS2 mounted in the same VM and ended up with: >>>> >>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_gFeGg5&d=DwICAg&c=R >>>> >> oP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtY >> >>>> >> n-0afBpa7A&m=cYprGRHz-oQmhnx4HIke8sTdCG_tf8Jb-rF6sHnYLnk&s=ajWfQIlUZOpElFWxoKcmvTI >> >>>> k7J3PpuCJITcnXfJQHrc&e= >>> From the kernel crash backtrace, this problem should be that long >>> time >>> to acquiring spin_lock triggers a NMI interruption. >>> Could you give a detailed reproduce steps? since we want to reproduce >>> this issue in local, then try to fix it. >>> >>> >>> Thanks >>> Gang >>> >>>> >>>> After trying a lot of different kernels starting from the 3.x >>>> series, >>>> now we are using 4.13.2 latest kernel with default configuration but >>>> these issues still present. Is this OCFS2 project still being >>>> developed? >>>> With this crashing and unreliability it cannot be used in production >>>> unless you put in place bunch of safeguards to reset out the whole >>>> virtualmachine when it crashes. >>>> >>>> Thanks >>>> >>>> _______________________________________________ >>>> Ocfs2-users mailing list >>>> Ocfs2-users@oss.oracle.com >>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users