Re: [Linux-cluster] Problem with service migration with xen domU on diferent dom0 with redhat 5.4

Carlos VERMEJO RUIZ Sun, 25 Apr 2010 08:58:50 -0700

Some corrections: I change domain names to bogusdomain .com 

----- Mensaje original ----- 
De: "Carlos VERMEJO RUIZ" <[email protected]> 
Para: [email protected] 
Enviados: Domingo, 25 de Abril 2010 10:41:02 
Asunto: Re: Problem with service migration with xen domU on diferent dom0 with 
redhat 5.4



Almost solved: 

I double check my multicast traffic and found no multicast traffic could pass 
from server1 to server2, I corrected this changing my host table with the node 
names pointing to eth3 (the interface that is interconnecting with a crossover 
cable both machines) and distributing it between domUs and dom0s. I checked 
multicast communications with "nc -u -vvn -z <multicast_IP> 5405" . Now both 
nodes can see their status properly. node2 can see the services that are 
running on node1, before this they could not see. Another thing I did was to 
change the keys, now I am using the same key for domUs and dom0s decause 
fencing was not working. 

In operations, when I turn off vmapache1(node1), vmapache2(node2) detects it is 
offline and starts the service on its machine. When node1 comes up the service 
do not fall back, in my case this is desirable but I tested with fallback 
enabled and it did not worked. Also migration does not worked and fencing when 
I send the action to reboot or off, it gaves me a successful answer but it did 
not turn off or reboots the virtual machine vmapache1. 

But on dom 0 I found this message: 

Domain-0 00000000-0000-0000-0000-000000000000 00001 00001 
vmapache1 a41132aa-6dc3-137c-e4a7-57c31ba5208a 00001 00002 
vmjboss1 11f0d6c1-8c51-a792-67a8-807ceaa7157b 00001 00002 
Request to fence: vmapache1 
vmapache1 is running locally 
Plain TCP request 
Failed to call back 
Could call back for fence request: Bad file descriptor 
Domain UUID Owner State 
------ ---- ----- ----- 
Domain-0 00000000-0000-0000-0000-000000000000 00001 00001 
vmapache1 a41132aa-6dc3-137c-e4a7-57c31ba5208a 00001 00002 
vmjboss1 11f0d6c1-8c51-a792-67a8-807ceaa7157b 00001 00002 
Domain UUID Owner State 
------ ---- ----- ----- 
Domain-0 00000000-0000-0000-0000-000000000000 00001 00001 
vmapache1 a41132aa-6dc3-137c-e4a7-57c31ba5208a 00001 00002 
vmjboss1 11f0d6c1-8c51-a792-67a8-807ceaa7157b 00001 00002 
Request to fence: vmapache1 
vmapache1 is running locally 
Plain TCP request 
Failed to call back 
Could call back for fence request: Bad file descriptor 
Domain UUID Owner State 
------ ---- ----- ----- 
Domain-0 00000000-0000-0000-0000-000000000000 00001 00001 
vmapache1 a41132aa-6dc3-137c-e4a7-57c31ba5208a 00001 00002 
vmjboss1 11f0d6c1-8c51-a792-67a8-807ceaa7157b 00001 00002 

Q. This bad file descriptor coul be some error on node name of service, How can 
I check this name? 

Also on the logs I found this: 

Apr 24 22:15:38 vmapache01 clurgmgrd: [1842]: <info> Stopping Service 
apache:web1 
Apr 24 22:15:38 vmapache01 clurgmgrd: [1842]: <err> Checking Existence Of File 
/var/run/cluster/apache/apache:web1.p 
id [apache:web1] > Failed - File Doesn't Exist 
Apr 24 22:15:38 vmapache01 clurgmgrd: [1842]: <info> Stopping Service 
apache:web1 > Succeed 
Apr 24 22:15:38 vmapache01 clurgmgrd[1842]: <info> Services Initialized 
Apr 24 22:15:38 vmapache01 clurgmgrd[1842]: <info> State change: Local UP 
Apr 24 22:15:38 vmapache01 clurgmgrd[1842]: <info> State change: 
vmapache2.bogusdomain.com UP 
Apr 24 22:15:43 vmapache01 clurgmgrd[1842]: <notice> Starting stopped service 
service:web-scs 
Apr 24 22:15:43 vmapache01 clurgmgrd: [1842]: <info> Adding IPv4 address 
172.19.52.120/24 to eth0 
Apr 24 22:15:45 vmapache01 clurgmgrd: [1842]: <info> Starting Service 
apache:web1 
Apr 24 22:15:45 vmapache01 clurgmgrd[1842]: <notice> Service service:web-scs 
started 
Apr 24 22:17:56 vmapache01 clurgmgrd[1842]: <notice> Stopping service 
service:web-scs 
Apr 24 22:17:56 vmapache01 clurgmgrd: [1842]: <info> Stopping Service 
apache:web1 
Apr 24 22:17:57 vmapache01 clurgmgrd: [1842]: <err> Stopping Service 
apache:web1 > Failed - Application Is Still Run 
ning 
Apr 24 22:17:57 vmapache01 clurgmgrd: [1842]: <err> Stopping Service 
apache:web1 > Failed 
Apr 24 22:17:57 vmapache01 clurgmgrd[1842]: <notice> stop on apache "web1" 
returned 1 (generic error) 
Apr 24 22:17:57 vmapache01 clurgmgrd: [1842]: <info> Removing IPv4 address 
172.19.52.120/24 from eth0 
Apr 24 22:18:07 vmapache01 clurgmgrd[1842]: <crit> #12: RG service:web-scs 
failed to stop; intervention required 
Apr 24 22:18:07 vmapache01 clurgmgrd[1842]: <notice> Service service:web-scs is 
failed 
Apr 24 22:18:08 vmapache01 clurgmgrd[1842]: <warning> #70: Failed to relocate 
service:web-scs; restarting locally 
Apr 24 22:18:08 vmapache01 clurgmgrd[1842]: <err> #43: Service service:web-scs 
has failed; can not start. 
Apr 24 22:18:08 vmapache01 clurgmgrd[1842]: <alert> #2: Service service:web-scs 
returned failure code. Last Owner: 
vmapache1.bogusdomain.com 
Apr 24 22:18:08 vmapache01 clurgmgrd[1842]: <alert> #4: Administrator 
intervention required. 

I must add that in my httpd service I use port 443 with ssl configured with 
valid digital ssl certs pointing to floating IP and DNS domain registered, and 
also apache module mod-jk configured for load balancing two jboss virtual 
machines. The configuration file for ssl module has been hardened also. 

Q. Do I have to change cluster script for service apache. in order to shutdown 
service properly? any ideas? 



----------------------------------------- 
Carlos Vermejo Ruiz 
------------------------------------------- 

----- Mensaje original ----- 
De: "Carlos VERMEJO RUIZ" <[email protected]> 
Para: [email protected] 
Enviados: Viernes, 23 de Abril 2010 22:41:35 
Asunto: Re: Problem with service migration with xen domU on diferent dom0 with 
redhat 5.4 


There are two things that I would try. The first one is that the problem seems 
that multicast traffic is not being propagated well between nodes. One point 
that I did not mention is that all trafic is going through firewalls and 
switches, though I open tcp and udp traffic I am not so sure for multicast 
traffic. I made the test with fence_xvm -a 225.0.0.1 -I eth1 -H vmapache1 -ddd 
-o null but when I try through luci interface It did not work. Also multicast 
interfaces are eth1 on domUs and eth3 on dom0s perhaps som point on my config 
files has something wrong or I have to configure the multicast traffic on the 
linux interface. 

The other point is in reference to the keys, I am using tho different keys, one 
for domU vmapache1 and dom= node1 in the host server1 and another key for the 
domU vmapache2 and dom= node2 in the host server2. Is it necesary to share a 
key between domUs? Could I use one key for domUs and dom0s. 

The third point is to check the network configuration, do I have to configure 
something on the switches, what about firewall and routers? my domUs have two 
phisical networks one connected to a switch that is attending to the public and 
the other one is connected through a crossover cable between domUs. Also on 
dom0s I have two active interfaces one conected to a switch to attend internal 
network and the other one eth3 that are using the same physical interface with 
the crossover cable and are on the same network number for domUs. 

Any comments will be appreciated. 

Best regards, 



Carlos Vermejo Ruiz 

Dear Sir / Madame: 


I am implementing a two node cluster on domU providing apache service "webby" 
we have them on different dom0. This apache service also are load balancing a 
JBoss virtual machines but them are not part of the cluster, also I have 
configured a virtual machine with iscsi target to provide a shared quorum disk 
so our quorum is 2 votes from 3. 

The first thing that I noticed is that when I finished configuring the cluster 
with luci the service webby does not start automatically. I have to enable the 
service and them it started. 

Initially I had a problem with the xvm_fence. When I configured in dom0 an 
individual cluster and start cman on dom0 it used to start fence_xvmd but in 
one place I read that dom0 had to be in anothe cluster so I created anothe 
cluster with both dom0, but now they are not starting the fence_xvmd. That is 
why I am using fence_xvmd as a standalone with this config: fence_xvmd -LX -a 
225.0.0.1 -I eth3 

When I try from the domU to fence from command line it worked I use the 
command: 

fence_xvm -a 225.0.0.1 -I eth1 -H frederick -ddd -o null 

and produced: 

Waiting for connection from XVM host daemon. 
Issuing TCP challenge 
Responding to TCP challenge 
TCP Exchange + Authentication done... 
Waiting for return value from XVM host 
Remote: Operation failed 

In luci I configured the multicast address 225.0.0.1 and interface eth1 for 
cluster on domU and multicast address 225.0.0.1 and interface eth3 on dom0 by 
CLI 

Perhaps the problem I have is for the keys. I use one key that is shared 
between dom0 and domU on server1 and another key that is also shared between 
dom0 and domU on server2. Also on server1 I copied the key fence_xvm.key as 
fence_xvm-host1.key and distibuted to the other domU and both dom0. on server2 
I copied the key fence_xvm.key as fence_xvm-host2.key and distibuted to the the 
other domU and both dom0 

My cluster config is the following: 

<?xml version="1.0"?> 
<cluster alias="clusterapache01" config_version="52" name="clusterapache01"> 
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="60"/> 
<clusternodes> 
<clusternode name="172.19.52.121" nodeid="1" votes="1"> 
<fence> 
<method name="1"> 
<device domain="vmapache1" name="xenfence1"/> 
</method> 
</fence> 
<multicast addr="225.0.0.1" interface="eth1"/> 
</clusternode> 
<clusternode name="172.19.52.122" nodeid="2" votes="1"> 
<fence> 
<method name="1"> 
<device domain="vmapache2" name="xenfence2"/> 
</method> 
</fence> 
<multicast addr="225.0.0.1" interface="eth1"/> 
</clusternode> 
</clusternodes> 
<cman expected_votes="3"> 
<multicast addr="225.0.0.1"/> 
</cman> 
<fencedevices> 
<fencedevice agent="fence_xvm" key_file="/etc/cluster/fence_xvm-host1.key" 
name="xenfence1"/> 
<fencedevice agent="fence_xvm" key_file="/etc/cluster/fence_xvm-host2.key" 
name="xenfence2"/> 
</fencedevices> 
<rm log_level="7"> 
<failoverdomains> 
<failoverdomain name="prefer_node1" nofailback="1" ordered="1" restricted="1"> 
<failoverdomainnode name="172.19.52.121" priority="1"/> 
<failoverdomainnode name="172.19.52.122" priority="2"/> 
</failoverdomain> 
</failoverdomains> 
<resources> 
<ip address="172.19.52.120" monitor_link="1"/> 
<apache config_file="conf/httpd.conf" name="httpd" server_root="/etc/httpd" 
shutdown_wait="0"/> 
<netfs export="/data" force_unmount="0" fstype="nfs4" host="172.19.50.114" 
mountpoint="/var/www/html" name="htdoc" options="rw,no_root_squash"/> 
</resources> 
<service autostart="1" domain="prefer_node1" exclusive="0" name="webby" 
recovery="relocate"> 
<ip ref="172.19.52.120"/> 
<apache ref="httpd"/> 
</service> 
</rm> 
<fence_xvmd/> 
<totem consensus="4800" join="60" token="10000" 
token_retransmits_before_loss_const="20"/> 
<quorumd device="/dev/sda1" interval="2" min_score="1" tko="10" votes="1"> 
<heuristic interval="2" program="ping -c1 -t1 172.19.52.119" score="1"/> 
</quorumd> 
</cluster> 

Another strange thing is when I do a clustat on vmapache1 it recognizes the 
webby service as started on vmapache1and both nodes and quorumdisk online but 
on vmapache clustat only shows both nodes and the quorumdisk online, nothing 
abour any service. 

This is the log when I tried to make a migration: 
Apr 22 21:39:14 vmapache01 ccsd[2183]: Update of cluster.conf complete (version 
51 -> 52). 
Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: <notice> Reconfiguring 
Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: <info> Loading Service Data 
Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: <info> Applying new configuration 
#52 
Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: <info> Stopping changed resources. 
Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: <info> Restarting changed 
resources. 
Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: <info> Starting changed resources. 
Apr 22 21:40:07 vmapache01 clurgmgrd[2331]: <notice> Stopping service 
service:webby 
Apr 22 21:40:07 vmapache01 clurgmgrd: [2331]: <info> Stopping Service 
apache:httpd 
Apr 22 21:40:07 vmapache01 clurgmgrd: [2331]: <err> Checking Existence Of File 
/var/run/cluster/apache/apache:httpd.pid [apache:httpd] > Failed - File Doesn't 
Exist 
Apr 22 21:40:07 vmapache01 clurgmgrd: [2331]: <info> Stopping Service 
apache:httpd > Succeed 
Apr 22 21:40:07 vmapache01 clurgmgrd[2331]: <notice> Service service:webby is 
disabled 
Apr 22 21:40:07 vmapache01 clurgmgrd[2331]: <notice> Starting disabled service 
service:webby 
Apr 22 21:40:08 vmapache01 clurgmgrd: [2331]: <info> Adding IPv4 address 
172.19.52.120/24 to eth0 
Apr 22 21:40:09 vmapache01 clurgmgrd: [2331]: <info> Starting Service 
apache:httpd 
Apr 22 21:40:09 vmapache01 clurgmgrd[2331]: <notice> Service service:webby 
started 
Apr 22 21:43:29 vmapache01 qdiskd[5855]: <info> Quorum Daemon Initializing 
Apr 22 21:43:30 vmapache01 qdiskd[5855]: <info> Heuristic: 'ping -c1 -t1 
172.19.52.119' UP 
Apr 22 21:43:49 vmapache01 qdiskd[5855]: <info> Initial score 1/1 
Apr 22 21:43:49 vmapache01 qdiskd[5855]: <info> Initialization complete 
Apr 22 21:43:49 vmapache01 openais[2189]: [CMAN ] quorum device registered 
Apr 22 21:43:49 vmapache01 qdiskd[5855]: <notice> Score sufficient for master 
operation (1/1; required=1); upgrading 
Apr 22 21:44:13 vmapache01 qdiskd[5855]: <info> Assuming master role 
Apr 22 21:47:31 vmapache01 clurgmgrd[2331]: <notice> Stopping service 
service:webby 
Apr 22 21:47:31 vmapache01 clurgmgrd: [2331]: <info> Stopping Service 
apache:httpd 
Apr 22 21:47:33 vmapache01 clurgmgrd: [2331]: <err> Stopping Service 
apache:httpd > Failed - Application Is Still Running 
Apr 22 21:47:33 vmapache01 clurgmgrd: [2331]: <err> Stopping Service 
apache:httpd > Failed 
Apr 22 21:47:33 vmapache01 clurgmgrd[2331]: <notice> stop on apache "httpd" 
returned 1 (generic error) 
Apr 22 21:47:33 vmapache01 clurgmgrd: [2331]: <info> Removing IPv4 address 
172.19.52.120/24 from eth0 
Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: <crit> #12: RG service:webby failed 
to stop; intervention required 
Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: <notice> Service service:webby is 
failed 
Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: <warning> #70: Failed to relocate 
service:webby; restarting locally 
Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: <err> #43: Service service:webby 
has failed; can not start. 
Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: <alert> #2: Service service:webby 
returned failure code. Last Owner: 172.19.52.121 
Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: <alert> #4: Administrator 
intervention required. 
Apr 22 21:50:31 vmapache01 clurgmgrd[2331]: <notice> Stopping service 
service:webby 
Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: <info> Stopping Service 
apache:httpd 
Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: <err> Checking Existence Of File 
/var/run/cluster/apache/apache:httpd.pid [apache:httpd] > Failed - File Doesn't 
Exist 
Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: <info> Stopping Service 
apache:httpd > Succeed 
Apr 22 21:50:31 vmapache01 clurgmgrd[2331]: <notice> Service service:webby is 
disabled 
Apr 22 21:50:31 vmapache01 clurgmgrd[2331]: <notice> Starting disabled service 
service:webby 
Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: <info> Adding IPv4 address 
172.19.52.120/24 to eth0 
Apr 22 21:50:32 vmapache01 clurgmgrd: [2331]: <info> Starting Service 
apache:httpd 
Apr 22 21:50:33 vmapache01 clurgmgrd[2331]: <notice> Service service:webby 
started 
Apr 22 21:50:50 vmapache01 clurgmgrd[2331]: <notice> Stopping service 
service:webby 
Apr 22 21:50:51 vmapache01 clurgmgrd: [2331]: <info> Stopping Service 
apache:httpd 
Apr 22 21:50:52 vmapache01 clurgmgrd: [2331]: <err> Stopping Service 
apache:httpd > Failed - Application Is Still Running 
Apr 22 21:50:52 vmapache01 clurgmgrd: [2331]: <err> Stopping Service 
apache:httpd > Failed 
Apr 22 21:50:52 vmapache01 clurgmgrd[2331]: <notice> stop on apache "httpd" 
returned 1 (generic error) 
Apr 22 21:50:52 vmapache01 clurgmgrd: [2331]: <info> Removing IPv4 address 
172.19.52.120/24 from eth0 
Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: <crit> #12: RG service:webby failed 
to stop; intervention required 
Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: <notice> Service service:webby is 
failed 
Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: <warning> #70: Failed to relocate 
service:webby; restarting locally 
Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: <err> #43: Service service:webby 
has failed; can not start. 
Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: <alert> #2: Service service:webby 
returned failure code. Last Owner: 172.19.52.121 
Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: <alert> #4: Administrator 
intervention required. 
Apr 22 21:52:41 vmapache01 clurgmgrd[2331]: <notice> Stopping service 
service:webby 
Apr 22 21:52:41 vmapache01 clurgmgrd: [2331]: <info> Stopping Service 
apache:httpd 
Apr 22 21:52:41 vmapache01 clurgmgrd: [2331]: <err> Checking Existence Of File 
/var/run/cluster/apache/apache:httpd.pid [apache:httpd] > Failed - File Doesn't 
Exist 

If you see something wrong let me know , Any help or ideas will be appreciated. 

Best regards, 






----------------------------------------- 
Carlos Vermejo Ruiz 
-------------------------------------------

--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] Problem with service migration with xen domU on diferent dom0 with redhat 5.4

Reply via email to