Andrew Beekhof wrote: > On 4/19/07, Peter Kruse <[EMAIL PROTECTED]> wrote: >> Andrew Beekhof wrote: >> > then i'm afraid your use of the "dont fence nodes on startup" option >> > has come back to haunt you >> > >> > beosrv-c-1 came up but was not able to find beosrv-c-2 (even though it >> > _was_ running) and because of that option beosrv-c-1 just pretended >> > beosrv-c-2 wasn't running and happily started activating resources. >> > >> > remember how we said that option wasn't a good idea :-) >> >> Hm, I don't understand, beosrv-c-2 fenced beosrv-c-1 in order >> to take over. Now you say, that as soon as beosrv-c-1 came back >> up again, it should fence beosrv-c-2, because it "thought" it >> was not there, but it was there? How can this happen? > > usually an enduring communications failure (be it physical or in our > software) but i'm no expert regarding the membership and > communications layers > > But I see a lot of messages like: > Apr 19 09:49:47 beosrv-c-1 heartbeat: [4495]: WARN: Rexmit of seq > 3553687 requested. 141 is max. > > so _something_ isn't right. > > probably worthy of a bug report.
There have been some bugs in this code in the last year or so. I've forgotten what they were, unfortunately. A hint is the string "ERROR:". We don't use that word lightly. If you get an ERROR: from one of our pieces of code, the chances are 99% that it shouldn't _ever_ happen. Getting it hundreds of times like you did is a really bad sign. Apr 19 09:48:27 beosrv-c-2 heartbeat: [10763]: ERROR: Message hist queue is filling up (200 messages in queue) Apr 19 09:48:27 beosrv-c-2 heartbeat: [10763]: ERROR: Message hist queue is filling up (200 messages in queue) What this message normally means is that you have a half-duplex communication failure. That is, one node can transmit but not receive, or vice versa... Are both systems version 2.0.5? [I'm guessing not] Is there a chance that you installed a 2.0.5 pre-release? Because there was a bug fix which went in just as 2.0.5 was coming out. And this fix: http://hg.linux-ha.org/dev/rev/6b8bdf5332c3 which could have affected you. How long was this node down? It looks to me like either it had been down a very long time, or a very short time. Which is it? If it was a very short time, then we have fixed the problem I believe... This particular sequence of messages is interesting... Apr 19 09:48:31 beosrv-c-2 heartbeat: [10763]: WARN: 1 lost packet(s) for [beosrv-c-1] [17:19] Apr 19 09:48:31 beosrv-c-2 cib: [10790]: info: mask(callbacks.c:cib_client_status_callback): Status update: Client beosrv-c-1/cib now has status [join] Apr 19 09:48:32 beosrv-c-2 heartbeat: [10763]: info: No pkts missing from beosrv-c-1! Apr 19 09:48:32 beosrv-c-2 heartbeat: [10763]: ERROR: Message hist queue is filling up (200 messages in queue) Here is what these messages mean: We received message 17 and 19 from beosrv-c-1. We didn't receive message 18 from beosrv-c-1. The code would then ask for packet to be retransmitted from beosrv-c-1. The CIB received a message from the CIB on beosrv-c-1, indicating that the CIB process on beosrv-c-1 is now running. Beosrv-c-1 retransmitted packet 18. We received packet 18, and now no packets are missing. The "Message hist queue is filling up" message means we have sent 200 packets without receiving an flow-control ack from someone. If there are only two nodes, that would mean beosrv-c-2. HOWEVER, we can definitely send and receive packets to and from both machines as witnessed by the "lost packet" followed by the "No pkts missing" sequence. This cannot have happened if we had a half-duplex comm failure. I know we fixed a couple of bugs in this area, but I'm not sure when the last one was fixed. I looked at bugzilla, and if a bugzilla had been made for every fix, then I don't see an obvious fix which was made after 2.0.5. -- Alan Robertson <[EMAIL PROTECTED]> "Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
