Re: [Linux-HA] strange node states and connection errors after a node partition

Andrew Beekhof Wed, 16 Apr 2008 01:02:37 -0700

I'm 99% sure this is something we've fixed since 2.0.8 (which is some
time ago now)


Lars has posted rpms of the latest RC for Heartbeat in SP2 at:
   http://download.opensuse.org/repositories/home:/LarsMB/SLE_10/

You could try those or, if that makes you uncomfortable, contact
Novell support directly and indicate that the issue is more than
likely already fixed upstream.

On Tue, Apr 15, 2008 at 9:18 PM, Mario Peter <[EMAIL PROTECTED]> wrote:
> Hi everybody,
>
>  i have a strange problem with a our cluster configuration (SLES10,
>  2.6.16.46-0.12-smp i686, heartbeat 2.0.8) . There are 3 seperate cluster
>  pairs on a ESX (vmware) server cluster. Each node of a cluster resides
>  on an esx server and is connected through 3 NIC's (2 client networks and
>  one seperate heartbeat cable) . Everything went fine for half a year ,
>  however once or twice a year all cluster nodes on one ESX server side
>  seem to freeze and the cluster nodes failover.
>
>  However, this time we run in problems. We had some kind of split brain
>  behavior, and when I got this after 3 hours , I reset all nodes on
>  the first (formerly freezed) ESX server. And now the upcoming nodes
>  won't connect to the running node anymore. I attached the logs from one
>  of the smaller cluster configs - umd160 (rebooted) and umd161 (still
>  alive). The action begun at 2008/04/15_05:46.
>
>  This is what crm_mon tells about the state right now:
>
>  ,----[ rebooted umd160: crm_mon -1r  ]
>  | ============
>  | Last updated: Tue Apr 15 15:46:27 2008
>  | Current DC: umd161 (a9f89d0e-678c-42da-8e25-7cfb2ecff774)
>  | 2 Nodes configured.
>  | 3 Resources configured.
>  | ============
>  |
>  | Node: umd161 (a9f89d0e-678c-42da-8e25-7cfb2ecff774): OFFLINE
>  | Node: umd160 (e4c1054d-b8c4-43dd-8827-66c5876b6cf1): OFFLINE
>  |
>  | Full list of resources:
>  |
>  | Resource Group: rg_ip_saprsireo
>  |     rs_ip_saprsireo_d   (heartbeat::ocf:IPaddr):        Stopped
>  |     rs_ip_saprsireo_k   (heartbeat::ocf:IPaddr):        Stopped
>  | Resource Group: rg_ip_saprftp
>  |     rs_ip_saprftp_k     (heartbeat::ocf:IPaddr):        Stopped
>  | Resource Group: rg_ip_saprproxy
>  |     rs_ip_saprproxy_k   (heartbeat::ocf:IPaddr):        Stopped
>  |
>  `----
>
>  ,----[ running umd161: crm_mon -1r  ]
>  | ============
>  | Last updated: Tue Apr 15 15:50:24 2008
>  | Current DC: umd160 (e4c1054d-b8c4-43dd-8827-66c5876b6cf1)
>  | 2 Nodes configured.
>  | 3 Resources configured.
>  | ============
>  |
>  | Node: umd161 (a9f89d0e-678c-42da-8e25-7cfb2ecff774): online
>  | Node: umd160 (e4c1054d-b8c4-43dd-8827-66c5876b6cf1): online
>  |
>  | Full list of resources:
>  |
>  | Resource Group: rg_ip_saprsireo
>  |     rs_ip_saprsireo_d   (heartbeat::ocf:IPaddr):        Started umd161
>  |     rs_ip_saprsireo_k   (heartbeat::ocf:IPaddr):        Started umd161
>  | Resource Group: rg_ip_saprftp
>  |     rs_ip_saprftp_k     (heartbeat::ocf:IPaddr):        Started umd161
>  | Resource Group: rg_ip_saprproxy
>  |     rs_ip_saprproxy_k   (heartbeat::ocf:IPaddr):        Started umd161
>  `----
>
>  Note the entries about the current DC. It seems that the crmd on the
>  umd161 doesn't work properly anymore too. When the umd160 tries to
>  connect to the cluster, it gets denied because of
>
>  ,----
>  | Ignoring HA message (op=join_offer) from umd160: not in our membership
>  | list
>  `----
>
>  The reources are up and running, but i won't touch the systems because
>  of there productive state... The other clusters have the have had the
>  same freezes and identical behavior.
>
>  What is here the problem. Is there a possibility to reactivate the
>  living side (umd161) to let of the other node reconnect?  How can avoid
>  this in the future? It is curious that this problem happend to all three
>  (totally seperate) servers in the same manner, isn't it?
>
>  Btw, this situation didn't occur in the past, the only thing what changed
>  recently was the heartbeat version upgrade (from vanilla sles10, heartbeat
>  2.0.5 up to sles10sp1, heartbeat 2.0.8) .
>
>  Btw2, there are a lot of entries in the log about missing NIC's - I
>  think you can ignore this because this was during the first reboot where
>  the network didn't come up .
>
>
>  Thanks in advance for looking on it!
>
>  Regards,
>  Mario Peter
>  --
>  Mario Peter
>  de,pl,en
>
>
> _______________________________________________
>  Linux-HA mailing list
>  [email protected]
>  http://lists.linux-ha.org/mailman/listinfo/linux-ha
>  See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] strange node states and connection errors after a node partition

Reply via email to