[Openais] Can't get udpu to work with basic 2-node Corosync cluster.

ROSSER Brad (SPARQ) Fri, 14 Dec 2012 00:31:56 -0800

Hi.  I'm having very basic problems in getting Corosync to work for a 2-node 
cluster using unicast UDP transport.


I'm running Red Hat Enterprise Linux 6.2 on the two nodes, with Red Hat's 
Pacemaker 1.1.6-3 and Corosync 1.4.1-4 RPM packages installed.

I had followed the steps in the (excellent) 'Pacemaker 1.1 - Clusters from 
Scratch' manual, got to the end of Chapter 5, which deals with testing failover 
of a simple virtual IP resource from one node to the other upon shutting down 
Corosync & Pacemaker on the first node, and then thought I'd stop at that point 
and repeat the exercise with the transport changed from the default UDP 
multicast transport to 'udpu', as I will be forced to use unicast UDP in my 
final configuration.  I modified my corosync.conf files on the two nodes and I 
am having significant problems.

Both test nodes are KVM virtual machines sitting 'side by side' on the same 
hypervisor.  Node H has the address 10.198.156.47; node I has address 
10.198.156.48.  I've appended the corosync.conf files to the end of this 
message.  The Pacemaker 'pcmk' plugin is set to version '1'.

My first problem is that the cluster won't start until *both* nodes are 
started.  When I start the corosync service on node H it goes through the 
regular startup sequence in the log file with no problems but then - instead of 
forming a one-node cluster and establishing the virtual IP resource 
(no-quorum-policy is set to "ignore") - it goes into a loop, producing these 
messages every two seconds:

 corosync [TOTEM ] Totem is unable to form a cluster because of an operating 
system or network fault. The most common cause of this message is that the 
local firewall is configured improperly.

When I start the pacemaker service only the 'pacemakerd' daemon starts; it 
doesn't fire off the various (including crmd) heartbeat processes.

It's only when I start corosync and packemaker on the second node, node I, that 
both machines form a cluster, the heartbeat processes start on both, and the 
virtual IP resource is configured on node H.

So that's my first problem; on starting a single node cold the cluster doesn't 
form, like it does for multicast.  The first node goes into its error message 
loop until the second node is also started.

The problem is compounded further by what happens when I test a shutdown of the 
second node, machine I.  When pacemaker and corosync are shut down on node I 
node H again starts printing the same 'network fault?' message every two 
seconds.  However the pacemaker/heartbeat processes remain up on H and crm_mon 
correctly reports that node I is offline, etc.

BUT ... when I restart corosync and pacemaker once more on node I immediately 
afterwards *both* nodes go into the 'network fault' loop, with node I just like 
H at the start (no heartbeat processes spawned by pacemakerd) and crm_mon still 
reporting node I as offline.  The pacemakerd process ultimately exits on node I 
about 9 minutes later saying:

  pacemakerd: [15668]: ERROR: main: Couldn't connect to Corosync's CFG service

There seems to be something quite wrong here -

1.  The first node won't form a cluster of its own; it only starts the 
heartbeat process once the second node starts up;

2.  Both nodes fail to form a cluster when one of the nodes is stopped and 
restarted.

The 'no-quorum-policy' is set to 'ignore' (as per chapter 5).

There are no firewall (iptables) rules at all on either of the VM nodes.

UDP was tested as working fine between the two (with netcat (nc)).

And if I put back the original multicast corosync.conf files on both nodes 
everything goes back to working as expected.  So no other files or settings 
seem to be involved.

(As an aside I'll mention that I couldn't get node I to work at all - it 
produced a group of pcmk error messages every few seconds - until I modified 
its corosync.conf to have itself, node I, as the first 'member' in the 
totem.interface block.  It woudn't work at all when listed second.)

Can anyone help?  I won't be able to use multicast in my final production 
configuration, so I desperately need corosync to be able to work properly with 
udpu.  I'd be most grateful for any assistance.

Thanks,


Brad


This is the output of 'crm configure show':

  node node_h
  node node_i
  primitive ClusterIP ocf:heartbeat:IPaddr2 \
        params ip="10.198.156.50" cidr_netmask="32" \
        op monitor interval="30s"
  property $id="cib-bootstrap-options" \
        dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"


This is the corosync.conf of node 'H':

  compatibility: whitetank

  totem {
    version: 2
    secauth: off
    threads: 0

    interface {
      member {
        mamberaddr: 10.198.156.47
      }
      member {
        memberaddr: 10.198.156.48
      }

      ringnumber: 0
      bindnetaddr: 10.198.156.0
      mcastport: 5405
      ttl: 1
    }

    transport: udpu
  }

  logging {
    fileline: off
    to_stderr: no
    to_logfile: yes
    to_syslog: yes
    logfile: /var/log/cluster/corosync.log
    debug: off
    timestamp: on
    logger_subsys {
      subsys: AMF
      debug: off
    }
  }

  amf {
    mode: disabled
  }


This is the file for node I:

  compatibility: whitetank

  totem {
    version: 2
    secauth: off
    threads: 0

    interface {
      member {
        mamberaddr: 10.198.156.48
      }
      member {
        memberaddr: 10.198.156.47
      }

      ringnumber: 0
      bindnetaddr: 10.198.156.0
      mcastport: 5405
      ttl: 1
    }

    transport: udpu
  }

  logging {
    fileline: off
    to_stderr: no
    to_logfile: yes
    to_syslog: yes
    logfile: /var/log/cluster/corosync.log
    debug: off
    timestamp: on
    logger_subsys {
      subsys: AMF
      debug: off
    }
  }

  amf {
    mode: disabled
  }


Thanks again.

 
*************************************************************************************
This email message (including any file attachments transmitted with it) is for 
the sole use of the intended recipient(s) and may contain confidential and 
legally privileged information. Any unauthorised review, use, alteration, 
disclosure or distribution of this email (including any attachments) by an 
unintended recipient is prohibited. If you have received this email in error, 
please notify the sender by return email and destroy all copies of the original 
message. Any confidential or legal professional privilege is not waived or lost 
by any mistaken delivery of the email. SPARQ Solutions accepts no 
responsibility for the content of any email which is sent by an employee which 
is of a personal nature.

Sender Details:
  SPARQ Solutions
  PO Box 15760 City East, Brisbane QLD Australia 4002
  +61 7 4931 2222

SPARQ Solutions policy is to not send unsolicited electronic messages. 
Suspected breaches of this policy can be reported by replying to this message 
including the original message and the word "UNSUBSCRIBE" in the subject. 

*************************************************************************************

_______________________________________________
Openais mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/openais

[Openais] Can't get udpu to work with basic 2-node Corosync cluster.

Reply via email to