Hi. I'm having very basic problems in getting Corosync to work for a 2-node
cluster using unicast UDP transport.
I'm running Red Hat Enterprise Linux 6.2 on the two nodes, with Red Hat's
Pacemaker 1.1.6-3 and Corosync 1.4.1-4 RPM packages installed.
I had followed the steps in the (excellent) 'Pacemaker 1.1 - Clusters from
Scratch' manual, got to the end of Chapter 5, which deals with testing failover
of a simple virtual IP resource from one node to the other upon shutting down
Corosync & Pacemaker on the first node, and then thought I'd stop at that point
and repeat the exercise with the transport changed from the default UDP
multicast transport to 'udpu', as I will be forced to use unicast UDP in my
final configuration. I modified my corosync.conf files on the two nodes and I
am having significant problems.
Both test nodes are KVM virtual machines sitting 'side by side' on the same
hypervisor. Node H has the address 10.198.156.47; node I has address
10.198.156.48. I've appended the corosync.conf files to the end of this
message. The Pacemaker 'pcmk' plugin is set to version '1'.
My first problem is that the cluster won't start until *both* nodes are
started. When I start the corosync service on node H it goes through the
regular startup sequence in the log file with no problems but then - instead of
forming a one-node cluster and establishing the virtual IP resource
(no-quorum-policy is set to "ignore") - it goes into a loop, producing these
messages every two seconds:
corosync [TOTEM ] Totem is unable to form a cluster because of an operating
system or network fault. The most common cause of this message is that the
local firewall is configured improperly.
When I start the pacemaker service only the 'pacemakerd' daemon starts; it
doesn't fire off the various (including crmd) heartbeat processes.
It's only when I start corosync and packemaker on the second node, node I, that
both machines form a cluster, the heartbeat processes start on both, and the
virtual IP resource is configured on node H.
So that's my first problem; on starting a single node cold the cluster doesn't
form, like it does for multicast. The first node goes into its error message
loop until the second node is also started.
The problem is compounded further by what happens when I test a shutdown of the
second node, machine I. When pacemaker and corosync are shut down on node I
node H again starts printing the same 'network fault?' message every two
seconds. However the pacemaker/heartbeat processes remain up on H and crm_mon
correctly reports that node I is offline, etc.
BUT ... when I restart corosync and pacemaker once more on node I immediately
afterwards *both* nodes go into the 'network fault' loop, with node I just like
H at the start (no heartbeat processes spawned by pacemakerd) and crm_mon still
reporting node I as offline. The pacemakerd process ultimately exits on node I
about 9 minutes later saying:
pacemakerd: [15668]: ERROR: main: Couldn't connect to Corosync's CFG service
There seems to be something quite wrong here -
1. The first node won't form a cluster of its own; it only starts the
heartbeat process once the second node starts up;
2. Both nodes fail to form a cluster when one of the nodes is stopped and
restarted.
The 'no-quorum-policy' is set to 'ignore' (as per chapter 5).
There are no firewall (iptables) rules at all on either of the VM nodes.
UDP was tested as working fine between the two (with netcat (nc)).
And if I put back the original multicast corosync.conf files on both nodes
everything goes back to working as expected. So no other files or settings
seem to be involved.
(As an aside I'll mention that I couldn't get node I to work at all - it
produced a group of pcmk error messages every few seconds - until I modified
its corosync.conf to have itself, node I, as the first 'member' in the
totem.interface block. It woudn't work at all when listed second.)
Can anyone help? I won't be able to use multicast in my final production
configuration, so I desperately need corosync to be able to work properly with
udpu. I'd be most grateful for any assistance.
Thanks,
Brad
This is the output of 'crm configure show':
node node_h
node node_i
primitive ClusterIP ocf:heartbeat:IPaddr2 \
params ip="10.198.156.50" cidr_netmask="32" \
op monitor interval="30s"
property $id="cib-bootstrap-options" \
dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore"
This is the corosync.conf of node 'H':
compatibility: whitetank
totem {
version: 2
secauth: off
threads: 0
interface {
member {
mamberaddr: 10.198.156.47
}
member {
memberaddr: 10.198.156.48
}
ringnumber: 0
bindnetaddr: 10.198.156.0
mcastport: 5405
ttl: 1
}
transport: udpu
}
logging {
fileline: off
to_stderr: no
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
amf {
mode: disabled
}
This is the file for node I:
compatibility: whitetank
totem {
version: 2
secauth: off
threads: 0
interface {
member {
mamberaddr: 10.198.156.48
}
member {
memberaddr: 10.198.156.47
}
ringnumber: 0
bindnetaddr: 10.198.156.0
mcastport: 5405
ttl: 1
}
transport: udpu
}
logging {
fileline: off
to_stderr: no
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
amf {
mode: disabled
}
Thanks again.
*************************************************************************************
This email message (including any file attachments transmitted with it) is for
the sole use of the intended recipient(s) and may contain confidential and
legally privileged information. Any unauthorised review, use, alteration,
disclosure or distribution of this email (including any attachments) by an
unintended recipient is prohibited. If you have received this email in error,
please notify the sender by return email and destroy all copies of the original
message. Any confidential or legal professional privilege is not waived or lost
by any mistaken delivery of the email. SPARQ Solutions accepts no
responsibility for the content of any email which is sent by an employee which
is of a personal nature.
Sender Details:
SPARQ Solutions
PO Box 15760 City East, Brisbane QLD Australia 4002
+61 7 4931 2222
SPARQ Solutions policy is to not send unsolicited electronic messages.
Suspected breaches of this policy can be reported by replying to this message
including the original message and the word "UNSUBSCRIBE" in the subject.
*************************************************************************************
_______________________________________________
Openais mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/openais