I just come back from a trip and made some changes at my cluster.conf but now I am getting a more clear error:
May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed Also I got more information telling me that cluster services on node 1 are down, when I restart rgmanager it starts working. More details: [r...@vmapache2 ~]# service rgmanager status Se está ejecutando clurgmgrd (pid 1866)... [r...@vmapache2 ~]# cman_tool status Version: 6.2.0 Config Version: 60 Cluster Name: clusterapache01 Cluster Id: 38965 Cluster Member: Yes Cluster Generation: 300 Membership state: Cluster-Member Nodes: 2 Expected votes: 3 Quorum device votes: 1 Total votes: 3 Quorum: 2 Active subsystems: 10 Flags: Dirty Ports Bound: 0 11 177 Node name: vmapache2.foo.com Node ID: 2 Multicast addresses: 225.0.0.1 Node addresses: 172.19.168.122 [r...@vmapache2 ~]# /Var/log/messages May 10 20:27:07 vmapache2 openais[1562]: [CLM ] got nodejoin message 172.19.168.121 May 10 20:27:07 vmapache2 openais[1562]: [CLM ] got nodejoin message 172.19.168.122 May 10 20:27:07 vmapache2 openais[1562]: [CPG ] got joinlist message from node 2 May 10 20:27:23 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response May 10 20:27:23 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35940). May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed May 10 20:27:29 vmapache2 kernel: dlm: connecting to 1 May 10 20:27:29 vmapache2 kernel: dlm: got connection from 1 May 10 20:27:41 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com UP May 10 20:27:07 vmapache2 openais[1562]: [CLM ] got nodejoin message 172.19.168.121 May 10 20:27:07 vmapache2 openais[1562]: [CLM ] got nodejoin message 172.19.168.122 May 10 20:27:07 vmapache2 openais[1562]: [CPG ] got joinlist message from node 2 May 10 20:27:23 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response May 10 20:27:23 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35940). May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed May 10 20:27:29 vmapache2 kernel: dlm: connecting to 1 May 10 20:27:29 vmapache2 kernel: dlm: got connection from 1 May 10 20:27:41 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com UP [r...@vmapache2 ~]# tail -n 100 /var/log/messages May 10 20:24:25 vmapache2 openais[1562]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). May 10 20:24:25 vmapache2 openais[1562]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). May 10 20:24:25 vmapache2 openais[1562]: [TOTEM] entering GATHER state from 2. May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering GATHER state from 0. May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Creating commit token because I am the rep. May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Saving state aru 49 high seq received 49 May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Storing new sequence id for ring 128 May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering COMMIT state. May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering RECOVERY state. May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] position [0] member 172.19.168.122: May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] previous ring seq 292 rep 172.19.168.121 May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] aru 49 high delivered 49 received flag 1 May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Did not need to originate any messages in recovery. May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Sending initial ORF token May 10 20:24:30 vmapache2 openais[1562]: [CLM ] CLM CONFIGURATION CHANGE May 10 20:24:30 vmapache2 openais[1562]: [CLM ] New Configuration: May 10 20:24:30 vmapache2 fenced[1620]: vmapache1.foo.com not a cluster member after 0 sec post_fail_delay May 10 20:24:30 vmapache2 kernel: dlm: closing connection to node 1 May 10 20:24:30 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com DOWN May 10 20:24:30 vmapache2 openais[1562]: [CLM ] r(0) ip(172.19.168.122) May 10 20:24:30 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com" May 10 20:24:30 vmapache2 openais[1562]: [CLM ] Members Left: May 10 20:24:30 vmapache2 openais[1562]: [CLM ] r(0) ip(172.19.168.121) May 10 20:24:30 vmapache2 openais[1562]: [CLM ] Members Joined: May 10 20:24:30 vmapache2 openais[1562]: [CLM ] CLM CONFIGURATION CHANGE May 10 20:24:30 vmapache2 openais[1562]: [CLM ] New Configuration: May 10 20:24:30 vmapache2 openais[1562]: [CLM ] r(0) ip(172.19.168.122) May 10 20:24:30 vmapache2 openais[1562]: [CLM ] Members Left: May 10 20:24:30 vmapache2 openais[1562]: [CLM ] Members Joined: May 10 20:24:30 vmapache2 openais[1562]: [SYNC ] This node is within the primary component and will provide service. May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering OPERATIONAL state. May 10 20:24:30 vmapache2 openais[1562]: [CLM ] got nodejoin message 172.19.168.122 May 10 20:24:30 vmapache2 openais[1562]: [CPG ] got joinlist message from node 2 May 10 20:24:35 vmapache2 clurgmgrd[1867]: <info> Waiting for node #1 to be fenced May 10 20:24:47 vmapache2 qdiskd[1604]: <info> Assuming master role May 10 20:24:49 vmapache2 openais[1562]: [CMAN ] lost contact with quorum device May 10 20:24:49 vmapache2 openais[1562]: [CMAN ] quorum lost, blocking activity May 10 20:24:49 vmapache2 clurgmgrd[1867]: <emerg> #1: Quorum Dissolved May 10 20:24:49 vmapache2 qdiskd[1604]: <notice> Writing eviction notice for node 1 May 10 20:24:49 vmapache2 openais[1562]: [CMAN ] quorum regained, resuming activity May 10 20:24:49 vmapache2 clurgmgrd: [1867]: <info> Stopping Service apache:web1 May 10 20:24:49 vmapache2 clurgmgrd: [1867]: <err> Checking Existence Of File /var/run/cluster/apache/apache:web1.pid [apache:web1] > Failed - File Doesn't Exist May 10 20:24:49 vmapache2 clurgmgrd: [1867]: <info> Stopping Service apache:web1 > Succeed May 10 20:24:49 vmapache2 clurgmgrd[1867]: <notice> Quorum Regained May 10 20:24:49 vmapache2 clurgmgrd[1867]: <info> State change: Local UP May 10 20:24:51 vmapache2 qdiskd[1604]: <notice> Node 1 evicted May 10 20:25:00 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response May 10 20:25:00 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (32130). May 10 20:25:00 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor May 10 20:25:00 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed May 10 20:25:05 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com" May 10 20:25:36 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response May 10 20:25:36 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (33270). May 10 20:25:36 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor May 10 20:25:36 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed May 10 20:25:41 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com" May 10 20:26:11 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response May 10 20:26:11 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed May 10 20:26:16 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com" May 10 20:26:47 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response May 10 20:26:47 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35010). May 10 20:26:47 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor May 10 20:26:47 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed May 10 20:26:52 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com" May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering GATHER state from 11. May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] Saving state aru 10 high seq received 10 May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] Storing new sequence id for ring 12c May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering COMMIT state. May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering RECOVERY state. May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] position [0] member 172.19.168.121: May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] previous ring seq 296 rep 172.19.168.121 May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] aru a high delivered a received flag 1 May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] position [1] member 172.19.168.122: May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] previous ring seq 296 rep 172.19.168.122 May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] aru 10 high delivered 10 received flag 1 May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] Did not need to originate any messages in recovery. May 10 20:27:07 vmapache2 openais[1562]: [CLM ] CLM CONFIGURATION CHANGE May 10 20:27:07 vmapache2 openais[1562]: [CLM ] New Configuration: May 10 20:27:07 vmapache2 openais[1562]: [CLM ] r(0) ip(172.19.168.122) May 10 20:27:07 vmapache2 openais[1562]: [CLM ] Members Left: May 10 20:27:07 vmapache2 openais[1562]: [CLM ] Members Joined: May 10 20:27:07 vmapache2 openais[1562]: [CLM ] CLM CONFIGURATION CHANGE May 10 20:27:07 vmapache2 openais[1562]: [CLM ] New Configuration: May 10 20:27:07 vmapache2 openais[1562]: [CLM ] r(0) ip(172.19.168.121) May 10 20:27:07 vmapache2 openais[1562]: [CLM ] r(0) ip(172.19.168.122) May 10 20:27:07 vmapache2 openais[1562]: [CLM ] Members Left: May 10 20:27:07 vmapache2 openais[1562]: [CLM ] Members Joined: May 10 20:27:07 vmapache2 openais[1562]: [CLM ] r(0) ip(172.19.168.121) May 10 20:27:07 vmapache2 openais[1562]: [SYNC ] This node is within the primary component and will provide service. May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering OPERATIONAL state. May 10 20:27:07 vmapache2 openais[1562]: [CLM ] got nodejoin message 172.19.168.121 May 10 20:27:07 vmapache2 openais[1562]: [CLM ] got nodejoin message 172.19.168.122 May 10 20:27:07 vmapache2 openais[1562]: [CPG ] got joinlist message from node 2 May 10 20:27:23 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response May 10 20:27:23 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35940). May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed May 10 20:27:29 vmapache2 kernel: dlm: connecting to 1 May 10 20:27:29 vmapache2 kernel: dlm: got connection from 1 May 10 20:27:41 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com UP Here is my cluster.conf file: <?xml version="1.0"?> <cluster alias="clusterapache01" config_version="60" name="clusterapache01"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="60"/> <clusternodes> <clusternode name="vmapache1.foo.com" nodeid="1" votes="1"> <fence> <method name="1"> <device domain="vmapache1" name="xenfence1"/> </method> </fence> <multicast addr="225.0.0.1" interface="eth1"/> </clusternode> <clusternode name="vmapache2.foo.com" nodeid="2" votes="1"> <fence> <method name="1"> <device domain="vmapache2" name="xenfence2"/> </method> </fence> <multicast addr="225.0.0.1" interface="eth1"/> </clusternode> </clusternodes> <cman expected_votes="3"> <multicast addr="225.0.0.1"/> </cman> <fencedevices> <fencedevice agent="fence_xvm" key_file="/etc/cluster/fence_xvm-host1.key" name="xenfence1"/> <fencedevice agent="fence_xvm" key_file="/etc/cluster/fence_xvm-host2.key" name="xenfence2"/> </fencedevices> <rm log_level="7"> <failoverdomains> <failoverdomain name="prefer_node1" nofailback="1" ordered="1" restricted="1"> <failoverdomainnode name="vmapache1.foo.com" priority="1"/> <failoverdomainnode name="vmapache2.foo.com" priority="2"/> </failoverdomain> </failoverdomains> <resources> <ip address="172.19.52.120" monitor_link="1"/> <netfs export="/data" force_unmount="0" fstype="nfs4" host="172.19.50.114" mountpoint="/var/www/html" name="htdoc" options="rw,no_root_squash"/> <apache config_file="conf/httpd.conf" name="web1" server_root="/etc/httpd" shutdown_wait="0"/> </resources> <service autostart="1" domain="prefer_node1" exclusive="0" name="web-scs" recovery="relocate"> <ip ref="172.19.52.120"/> <apache ref="web1"/> </service> </rm> <fence_xvmd/> <totem consensus="4800" join="60" token="10000" token_retransmits_before_loss_const="20"/> <quorumd device="/dev/sda1" interval="2" min_score="1" tko="10" votes="1"> <heuristic interval="2" program="ping -c1 -t1 172.19.52.119" score="1"/> </quorumd> </cluster> Best Regards, Carlos Vermejo Ruiz
-- Linux-cluster mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-cluster
