Hi Martin,

On 11/18/15 16:31, Martin Lund wrote:
> Hello List,
>
> Earlier I was writing about my 3 node web cluster running from OCFS2. I 
> experimented with different 3.x kernels including 3.2, 3.13, 3.16 but so far 
> the 4.1.1
How do you draw this conclusion? AFAIK, normally the lower version is 
likely more stable.
The newest kernel version we have tested for now is 3.12.28.
> proved to be the most stable. Never the less it had major crashes in the last 
> couple of days again (while the R/W operations are relatively low for my 
> setup).
> I have all 3 nodes running in KVM machines on the same server, communicating 
> with each other through libvirt-net driver (what as far as I understand is 
> just memory copy, no packets get send out to the wire so theoretically this 
> should provide gbit/s wide reliable, low latency link between the VMs). Now 
> the logs suggest that the nodes lose connection between one another sometimes 
> but this might not be a network issue but something is holding the cpu (the 
> host server which is running the same 4.1.1 kernel has more than enough 
> resources, 48 CPUs + 256GB ram). The only meaningful line for me in the log 
> is:
>
> NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
I have no idea why cpu#0 stuck so long. I think it's not a crash, but 
just a stack trace log.

Thanks,
Eric
>
> Any ideas? Maybe I should use different IO scheduler or something inside the 
> vm? Would upgrading the guest kernel from 4.1.1 to the latest stable improve 
> anything?
>
>
> Nov 14 19:05:17 webserver1 kernel: [2004352.064040] NMI watchdog: BUG: soft 
> lockup - CPU#0 stuck for 23s! [kworker/u2:1:16601]
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] Modules linked in: ocfs2 
> quota_tree nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace
> fscache sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager 
> ocfs2_stackglue configfs loop psmouse pcspkr joydev evdev serio_raw ac
> pi_cpufreq i2c_piix4 processor i2c_core button virtio_balloon thermal_sys 
> hid_generic usbhid dm_mod ata_generic virtio_net virtio_blk uhci_hcd
>   ehci_hcd ata_piix libata usbcore virtio_pci virtio_ring virtio usb_common
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] CPU: 0 PID: 16601 Comm: 
> kworker/u2:1 Not tainted 4.1.1
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] Hardware name: Bochs 
> Bochs, BIOS Bochs 01/01/2011
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] Workqueue: o2net 
> o2net_rx_until_empty [ocfs2_nodemanager]
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] task: ffff880033084290 
> ti: ffff88002b498000 task.ti: ffff88002b498000
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] RIP: 
> 0010:[<ffffffffa0193ca1>]  [<ffffffffa0193ca1>] 
> __dlm_lookup_lockres_full+0xa3/0xe9 [
> ocfs2_dlm]
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] RSP: 
> 0018:ffff88002b49bc28  EFLAGS: 00000286
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] RAX: 00000000ffffffc1 
> RBX: ffffffff81547380 RCX: 0000000000000017
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] RDX: 000000000000001e 
> RSI: ffff88006e9eb029 RDI: ffff880007c447a0
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] RBP: ffff88006e9eb028 
> R08: 0000000000000066 R09: 000000000000002a
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] R10: 000000000000002a 
> R11: dead000000200200 R12: 0000000000000246
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] R13: 0000000000000050 
> R14: 0000000000000000 R15: 0000000000000000
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] FS:  
> 00007f098602e7a0(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] CS:  0010 DS: 0000 ES: 
> 0000 CR0: 000000008005003b
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] CR2: ffffffffff600400 
> CR3: 000000007ca91000 CR4: 00000000000006f0
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] Stack:
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  ffff88002b49bcf0 
> 0000000000000040 ffff880033869000 ffff88006e9eb028
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  000000000000001f 
> 00000000c6232e1b ffff88006e9eb000 ffffffffa0193d73
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  ffff88007fc00000 
> ffff88006ec768e8 0000000000000082 ffff880033869000
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042] Call Trace:
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffffa0193d73>] ? 
> __dlm_lookup_lockres+0x8c/0xd3 [ocfs2_dlm]
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffffa0193df9>] ? 
> dlm_lookup_lockres+0x3f/0x5c [ocfs2_dlm]
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffffa01ae6e4>] ? 
> dlm_unlock_lock_handler+0x2af/0x663 [ocfs2_dlm]
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffffa0157315>] ? 
> o2net_handler_tree_lookup+0x5b/0xa8 [ocfs2_nodemanager]
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffffa0159552>] ? 
> o2net_rx_until_empty+0xc2c/0xc7c [ocfs2_nodemanager]
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff81001623>] ? 
> __switch_to+0x1d4/0x457
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff81065260>] ? 
> pick_next_task_fair+0x174/0x320
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff8105215e>] ? 
> process_one_work+0x179/0x283
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff81052445>] ? 
> worker_thread+0x1b8/0x292
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff8105228d>] ? 
> process_scheduled_works+0x25/0x25
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff81056175>] ? 
> kthread+0x99/0xa1
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff810560dc>] ? 
> __kthread_parkme+0x58/0x58
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff815d2b52>] ? 
> ret_from_fork+0x42/0x70
> Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff810560dc>] ? 
> __kthread_parkme+0x58/0x58
> ov 14 19:05:17 webserver1 kernel: [2004352.064042] Code: c0 75 02 0f 0b 48 8b 
> 7b 10 44 89 ee 31 db e8 8d d4 ff ff 48 8b 00 48 85 c0 74 46 45
> 8d 6c 24 ff 4c 8d 75 01 48 89 c3 48 8b 7b 18 <0f> be 45 00 0f b6 17 39 c2 75 
> 23 44 39 63 14 75 1d 48 ff c7 4c
> Nov 14 19:05:17 webserver1 kernel: [2004356.576077] o2net: Connection to node 
> webserver3 (num 2) at 10.0.0.247:7777 has been idle for 30.32 se
> cs.
>
>
> ...
> [Tue Nov 17 12:52:30 2015] o2net: Connection to node webserver1 (num 0) at 
> 10.0.0.245:7777 shutdown, state 7
> [Tue Nov 17 12:52:32 2015] o2net: Connection to node webserver1 (num 0) at 
> 10.0.0.245:7777 shutdown, state 7
> [Tue Nov 17 12:52:34 2015] o2net: Connection to node webserver1 (num 0) at 
> 10.0.0.245:7777 shutdown, state 7
> [Tue Nov 17 12:52:36 2015] o2net: Connection to node webserver1 (num 0) at 
> 10.0.0.245:7777 shutdown, state 7
> [Tue Nov 17 12:52:38 2015] o2net: Connection to node webserver1 (num 0) at 
> 10.0.0.245:7777 shutdown, state 7
> [Tue Nov 17 12:52:40 2015] o2net: Connection to node webserver1 (num 0) at 
> 10.0.0.245:7777 shutdown, state 7
> [Tue Nov 17 12:52:42 2015] o2net: Connection to node webserver1 (num 0) at 
> 10.0.0.245:7777 shutdown, state 7
> [Tue Nov 17 12:52:44 2015] o2net: Connection to node webserver1 (num 0) at 
> 10.0.0.245:7777 shutdown, state 7
> [Tue Nov 17 12:52:46 2015] o2net: Connection to node webserver1 (num 0) at 
> 10.0.0.245:7777 shutdown, state 7
> [Tue Nov 17 12:52:46 2015] o2net: Accepted connection from node webserver3 
> (num 2) at 10.0.0.247:7777
> [Tue Nov 17 12:52:48 2015] o2net: Connected to node webserver1 (num 0) at 
> 10.0.0.245:7777
> [Tue Nov 17 12:52:49 2015] o2dlm: Joining domain 
> 1CA770B625644665B7677546DCC1211C ( 0 1 2 ) 3 nodes
> [Tue Nov 17 12:52:49 2015] ocfs2: Mounting device (252,16) on (node 1, slot 
> 1) with writeback data mode.
> [Tue Nov 17 12:52:54 2015] o2dlm: Joining domain 
> 0DE1B15CBA5340F09A7908313FCB3680 ( 0 1 2 ) 3 nodes
> [Tue Nov 17 12:52:54 2015] ocfs2: Mounting device (252,32) on (node 1, slot 
> 0) with writeback data mode.
> [Tue Nov 17 12:52:58 2015] o2dlm: Joining domain 
> 573D9BC0E98B47BE8EBB8FE7F1CB5281 ( 0 1 2 ) 3 nodes
> [Tue Nov 17 12:52:58 2015] ocfs2: Mounting device (252,48) on (node 1, slot 
> 1) with writeback data mode.
>
>
> Thank you!
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>


_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to