On 25/11/13 10:39, Michał Margula wrote: > W dniu 25.11.2013 15:44, Digimer pisze: >> My first thought is that the network is congested. That is a lot of >> servers to have on the system. Do you or can you isolate the corosync >> traffic from the drbd traffic? >> >> Personally, I always setup a dedicated network for corosync, another for >> drbd and a third for all traffic to/from the servers. With this, I have >> never had a congestion-based problem. >> >> If possible, please past all logs from both nodes, starting just before >> the stonith occurred until recovery completed please. >> > > Hello, > > DRBD and CRM go over dedicated link (bonded two gigabit links into one). > It is never saturated nor congested, it barely reaches 300 Mbps in > highest points. I have a separate link for traffic from/to virtual > machines and also separate link to manage nodes (just for SSH, SNMP). I > can isolate corosync to separate link but it could take some time to do. > > Now logs... > > Trouble started at November 23, 15:14. > Here is a log from "A" node: http://pastebin.com/yM1fqvQ6 > Node B: http://pastebin.com/nwbctcgg > > Node B is the one that got hit by STONITH. It got killed at 15:18:50. I > have some trouble understanding reasons for that. > > Is reason for STONITH that those operations took long time to finish? > > Nov 23 15:14:49 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the > operation stop[114] on XEN-piaskownica for client 9529 stayed in > operation list for 24760 ms (longer than 10000 ms) > Nov 23 15:14:50 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the > operation stop[115] on XEN-acsystemy01 for client 9529 stayed in > operation list for 25760 ms (longer than 10000 ms) > Nov 23 15:15:15 rivendell-B lrmd: [9526]: WARN: perform_ra_op: the > operation stop[116] on XEN-frodo for client 9529 stayed in operation > list for 50760 ms (longer than 10000 ms) > > But I wonder what in first place made it to stop those virtual machines? > Another clue is here: > > Nov 23 15:15:43 rivendell-B lrmd: [9526]: WARN: configuration advice: > reduce operation contention either by increasing lrmd max_children or by > increasing intervals of monitor operations > > And here: > > coro-A.log:Nov 23 15:14:19 rivendell-A pengine: [8839]: WARN: > unpack_rsc_op: Processing failed op primitive-LVM:1_last_failure_0 on > rivendell-B: not running (7) > > But why not running? It is not really a true. Also some trouble with > fencing: > > coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN: > unpack_rsc_op: Processing failed op fencing-of-B_last_failure_0 on > rivendell-A: unknown error (1) > coro-A.log:Nov 23 15:22:25 rivendell-A pengine: [8839]: WARN: > common_apply_stickiness: Forcing fencing-of-B away from rivendell-A > after 1000000 failures (max=1000000) > > Thank you! >
I'd like to see the full logs, starting from a little before the issue started. It looks though like, for whatever reason, a stop was called, failed, so the node was fenced. This would mean that congestion, as you suggested, is not the likely cause. Out of curiosity though; what bonding mode are you using? My testing showed that only mode=1 was reliable. Since I tested, corosync added support for mode=0 and mode=2, but I've not re-tested them. When I was doing my bonding tests, I found all other modes to break communications in some manner of use or failure/recovery testing. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org