Hi Martin, On 11/18/15 16:31, Martin Lund wrote: > Hello List, > > Earlier I was writing about my 3 node web cluster running from OCFS2. I > experimented with different 3.x kernels including 3.2, 3.13, 3.16 but so far > the 4.1.1 How do you draw this conclusion? AFAIK, normally the lower version is likely more stable. The newest kernel version we have tested for now is 3.12.28. > proved to be the most stable. Never the less it had major crashes in the last > couple of days again (while the R/W operations are relatively low for my > setup). > I have all 3 nodes running in KVM machines on the same server, communicating > with each other through libvirt-net driver (what as far as I understand is > just memory copy, no packets get send out to the wire so theoretically this > should provide gbit/s wide reliable, low latency link between the VMs). Now > the logs suggest that the nodes lose connection between one another sometimes > but this might not be a network issue but something is holding the cpu (the > host server which is running the same 4.1.1 kernel has more than enough > resources, 48 CPUs + 256GB ram). The only meaningful line for me in the log > is: > > NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! I have no idea why cpu#0 stuck so long. I think it's not a crash, but just a stack trace log.
Thanks, Eric > > Any ideas? Maybe I should use different IO scheduler or something inside the > vm? Would upgrading the guest kernel from 4.1.1 to the latest stable improve > anything? > > > Nov 14 19:05:17 webserver1 kernel: [2004352.064040] NMI watchdog: BUG: soft > lockup - CPU#0 stuck for 23s! [kworker/u2:1:16601] > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] Modules linked in: ocfs2 > quota_tree nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace > fscache sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager > ocfs2_stackglue configfs loop psmouse pcspkr joydev evdev serio_raw ac > pi_cpufreq i2c_piix4 processor i2c_core button virtio_balloon thermal_sys > hid_generic usbhid dm_mod ata_generic virtio_net virtio_blk uhci_hcd > ehci_hcd ata_piix libata usbcore virtio_pci virtio_ring virtio usb_common > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] CPU: 0 PID: 16601 Comm: > kworker/u2:1 Not tainted 4.1.1 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] Hardware name: Bochs > Bochs, BIOS Bochs 01/01/2011 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] Workqueue: o2net > o2net_rx_until_empty [ocfs2_nodemanager] > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] task: ffff880033084290 > ti: ffff88002b498000 task.ti: ffff88002b498000 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] RIP: > 0010:[<ffffffffa0193ca1>] [<ffffffffa0193ca1>] > __dlm_lookup_lockres_full+0xa3/0xe9 [ > ocfs2_dlm] > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] RSP: > 0018:ffff88002b49bc28 EFLAGS: 00000286 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] RAX: 00000000ffffffc1 > RBX: ffffffff81547380 RCX: 0000000000000017 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] RDX: 000000000000001e > RSI: ffff88006e9eb029 RDI: ffff880007c447a0 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] RBP: ffff88006e9eb028 > R08: 0000000000000066 R09: 000000000000002a > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] R10: 000000000000002a > R11: dead000000200200 R12: 0000000000000246 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] R13: 0000000000000050 > R14: 0000000000000000 R15: 0000000000000000 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] FS: > 00007f098602e7a0(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] CS: 0010 DS: 0000 ES: > 0000 CR0: 000000008005003b > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] CR2: ffffffffff600400 > CR3: 000000007ca91000 CR4: 00000000000006f0 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] Stack: > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] ffff88002b49bcf0 > 0000000000000040 ffff880033869000 ffff88006e9eb028 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] 000000000000001f > 00000000c6232e1b ffff88006e9eb000 ffffffffa0193d73 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] ffff88007fc00000 > ffff88006ec768e8 0000000000000082 ffff880033869000 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] Call Trace: > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] [<ffffffffa0193d73>] ? > __dlm_lookup_lockres+0x8c/0xd3 [ocfs2_dlm] > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] [<ffffffffa0193df9>] ? > dlm_lookup_lockres+0x3f/0x5c [ocfs2_dlm] > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] [<ffffffffa01ae6e4>] ? > dlm_unlock_lock_handler+0x2af/0x663 [ocfs2_dlm] > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] [<ffffffffa0157315>] ? > o2net_handler_tree_lookup+0x5b/0xa8 [ocfs2_nodemanager] > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] [<ffffffffa0159552>] ? > o2net_rx_until_empty+0xc2c/0xc7c [ocfs2_nodemanager] > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] [<ffffffff81001623>] ? > __switch_to+0x1d4/0x457 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] [<ffffffff81065260>] ? > pick_next_task_fair+0x174/0x320 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] [<ffffffff8105215e>] ? > process_one_work+0x179/0x283 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] [<ffffffff81052445>] ? > worker_thread+0x1b8/0x292 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] [<ffffffff8105228d>] ? > process_scheduled_works+0x25/0x25 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] [<ffffffff81056175>] ? > kthread+0x99/0xa1 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] [<ffffffff810560dc>] ? > __kthread_parkme+0x58/0x58 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] [<ffffffff815d2b52>] ? > ret_from_fork+0x42/0x70 > Nov 14 19:05:17 webserver1 kernel: [2004352.064042] [<ffffffff810560dc>] ? > __kthread_parkme+0x58/0x58 > ov 14 19:05:17 webserver1 kernel: [2004352.064042] Code: c0 75 02 0f 0b 48 8b > 7b 10 44 89 ee 31 db e8 8d d4 ff ff 48 8b 00 48 85 c0 74 46 45 > 8d 6c 24 ff 4c 8d 75 01 48 89 c3 48 8b 7b 18 <0f> be 45 00 0f b6 17 39 c2 75 > 23 44 39 63 14 75 1d 48 ff c7 4c > Nov 14 19:05:17 webserver1 kernel: [2004356.576077] o2net: Connection to node > webserver3 (num 2) at 10.0.0.247:7777 has been idle for 30.32 se > cs. > > > ... > [Tue Nov 17 12:52:30 2015] o2net: Connection to node webserver1 (num 0) at > 10.0.0.245:7777 shutdown, state 7 > [Tue Nov 17 12:52:32 2015] o2net: Connection to node webserver1 (num 0) at > 10.0.0.245:7777 shutdown, state 7 > [Tue Nov 17 12:52:34 2015] o2net: Connection to node webserver1 (num 0) at > 10.0.0.245:7777 shutdown, state 7 > [Tue Nov 17 12:52:36 2015] o2net: Connection to node webserver1 (num 0) at > 10.0.0.245:7777 shutdown, state 7 > [Tue Nov 17 12:52:38 2015] o2net: Connection to node webserver1 (num 0) at > 10.0.0.245:7777 shutdown, state 7 > [Tue Nov 17 12:52:40 2015] o2net: Connection to node webserver1 (num 0) at > 10.0.0.245:7777 shutdown, state 7 > [Tue Nov 17 12:52:42 2015] o2net: Connection to node webserver1 (num 0) at > 10.0.0.245:7777 shutdown, state 7 > [Tue Nov 17 12:52:44 2015] o2net: Connection to node webserver1 (num 0) at > 10.0.0.245:7777 shutdown, state 7 > [Tue Nov 17 12:52:46 2015] o2net: Connection to node webserver1 (num 0) at > 10.0.0.245:7777 shutdown, state 7 > [Tue Nov 17 12:52:46 2015] o2net: Accepted connection from node webserver3 > (num 2) at 10.0.0.247:7777 > [Tue Nov 17 12:52:48 2015] o2net: Connected to node webserver1 (num 0) at > 10.0.0.245:7777 > [Tue Nov 17 12:52:49 2015] o2dlm: Joining domain > 1CA770B625644665B7677546DCC1211C ( 0 1 2 ) 3 nodes > [Tue Nov 17 12:52:49 2015] ocfs2: Mounting device (252,16) on (node 1, slot > 1) with writeback data mode. > [Tue Nov 17 12:52:54 2015] o2dlm: Joining domain > 0DE1B15CBA5340F09A7908313FCB3680 ( 0 1 2 ) 3 nodes > [Tue Nov 17 12:52:54 2015] ocfs2: Mounting device (252,32) on (node 1, slot > 0) with writeback data mode. > [Tue Nov 17 12:52:58 2015] o2dlm: Joining domain > 573D9BC0E98B47BE8EBB8FE7F1CB5281 ( 0 1 2 ) 3 nodes > [Tue Nov 17 12:52:58 2015] ocfs2: Mounting device (252,48) on (node 1, slot > 1) with writeback data mode. > > > Thank you! > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-users > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users