[Linux-HA] crmd (?) becomes unresponsive
Hi all, I'm experiencing difficulties with my 2-node cluster and I'm running out of ideas about how to fix this. I'd be glad if someone here could point me to the right direction. As said, it's a 2-node cluster, running with openSUSE 13.1 and the HA-Factory packages: cluster-glue: 1.0.12-rc1 (b5f1605097857b8b96bd517282ab300e2ad7af99) resource-agents: # Build version: f725724964882a407f7f33a97124da07a2b28d5d CRM Version: 1.1.10+git20140117.a3cda76-102.1 (1.1.10+git20140117.a3cda76) pacemaker 1.1.10+git20140117.a3cda76-102.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 libpacemaker3 1.1.10+git20140117.a3cda76-102.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 corosync 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 libcorosync4 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 resource-agents 3.9.5-63.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 cluster-glue 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 libglue2 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 ldirectord 3.9.5-63.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 Both nodes have two NIC's, one is connected to the world and the other one connects both nodes with a crossover cable. An internal subnet is used here and my /etc/hosts files are fine: 127.0.0.1 localhost 10.0.0.1s00201.ser4.de s00201 10.0.0.2s00202.ser4.de s00202 Corosync is configured with udpu and the firewall does not block any traffic between the interal NIC's. My cluster is up and running, both nodes are providing some services, filesystems are mirrored by drbd and the world is a happy place. :-) The cluster uses a valid and available DC. Editing and executing actions on resources is usually working fine. Sometimes, when I run a command like crm resource migrate grp_nginx or just a crm resource cleanup pri_svc_varnish it may happen that those commands don't return but timeout after a while. At this state even a crmadmin -D does not return. This happened a lot of times in the last days (I migrated to openSUSE 13.1 last week), so I tried different things to clear the problem, but nothing seems to work. I may happen that the STONITH mechanism is executed for one of the nodes. Interestingly the other node does not seem to recognize that it's alone then. crm status sometimes still shows both nodes as online. In other cases it may occur that the second node comes up again after rebooting, but it doesn't get found by the first node and appears offline. The network connection does not seem to have any problems. Communication is still possible between the nodes and I can see a lot of UDP traffic between both nodes. Most of the times I solve this by booting the unresponsive cluster node, too. This leads to other problems because my drbd devices become out of sync, services get stopped and so on. On the other hand, the node does not seem to heal itself, so no crm actions can successfully be executed. The last time that I dared to run a crm action was yesterday between 18 and 19 o'clock. I created a full hb_report that should contain all relevant information, including the pe-input files. I also enabled the debug logging mode for corosync, so extended logs are available, too. I used strace to find out what a simple crmadmin -D does. It ends with: --- uname({sys=Linux, node=s00201, ...}) = 0 uname({sys=Linux, node=s00201, ...}) = 0 uname({sys=Linux, node=s00201, ...}) = 0 uname({sys=Linux, node=s00201, ...}) = 0 uname({sys=Linux, node=s00201, ...}) = 0 futex(0x7fdccdcf2d48, FUTEX_WAKE_PRIVATE, 2147483647) = 0 mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fdccf39f000 socket(PF_LOCAL, SOCK_STREAM, 0)= 3 fcntl(3, F_GETFD) = 0 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 connect(3, {sa_family=AF_LOCAL, sun_path=@crmd}, 110) = 0 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 sendto(3, \377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0, 24, MSG_NOSIGNAL, NULL, 0) = 24 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0 recvfrom(3, 0x7fff4bc37d10, 12328, 16640, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=3, events=POLLIN}], 1, 4294967295Process 3781 detached detached ... --- (The full log is available) crmadmin tries to reach crmd, so I also straced the running crmd process. There's not much happening here: --- Process 8669 attached read(22, Process 8669 detached detached ... --- I killed the crmd process and it got restarted automatically (by pacemakerd?). After that, strace just shows countless messages like these: --- Process 7856 attached poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21, events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22, revents=POLLHUP}])
Re: [Linux-HA] crmd (?) becomes unresponsive
On 2014-01-22T09:55:10, Thomas Schulte tho...@cupracer.de wrote: Hi Thomas, since those are very recent upstream versions, I think you'll have a better chance to ask directly on the pacemaker mailing list, or directly report via bugs.clusterlabs.org - at least for providing the attachments, that's the best option. Best, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Antw: crmd (?) becomes unresponsive
Hi! I cannot really help you, but it proved to be helpful to open a wide fail -f /var/log/messages window for every cluster node while issuing the actual commands in another window. Maybe you could also watch the cluster with hawk or crm_mon. My favourite option set is -1Arf... Regards, Ulrich Thomas Schulte tho...@cupracer.de schrieb am 22.01.2014 um 09:55 in Nachricht 09d9d36ad571203a5b9b048da373d...@ser4.de: Hi all, I'm experiencing difficulties with my 2-node cluster and I'm running out of ideas about how to fix this. I'd be glad if someone here could point me to the right direction. As said, it's a 2-node cluster, running with openSUSE 13.1 and the HA-Factory packages: cluster-glue: 1.0.12-rc1 (b5f1605097857b8b96bd517282ab300e2ad7af99) resource-agents: # Build version: f725724964882a407f7f33a97124da07a2b28d5d CRM Version: 1.1.10+git20140117.a3cda76-102.1 (1.1.10+git20140117.a3cda76) pacemaker 1.1.10+git20140117.a3cda76-102.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 libpacemaker3 1.1.10+git20140117.a3cda76-102.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 corosync 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 libcorosync4 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 resource-agents 3.9.5-63.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 cluster-glue 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 libglue2 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 ldirectord 3.9.5-63.1 - network:ha-clustering:Factory / openSUSE_13.1 x86_64 Both nodes have two NIC's, one is connected to the world and the other one connects both nodes with a crossover cable. An internal subnet is used here and my /etc/hosts files are fine: 127.0.0.1 localhost 10.0.0.1s00201.ser4.de s00201 10.0.0.2s00202.ser4.de s00202 Corosync is configured with udpu and the firewall does not block any traffic between the interal NIC's. My cluster is up and running, both nodes are providing some services, filesystems are mirrored by drbd and the world is a happy place. :-) The cluster uses a valid and available DC. Editing and executing actions on resources is usually working fine. Sometimes, when I run a command like crm resource migrate grp_nginx or just a crm resource cleanup pri_svc_varnish it may happen that those commands don't return but timeout after a while. At this state even a crmadmin -D does not return. This happened a lot of times in the last days (I migrated to openSUSE 13.1 last week), so I tried different things to clear the problem, but nothing seems to work. I may happen that the STONITH mechanism is executed for one of the nodes. Interestingly the other node does not seem to recognize that it's alone then. crm status sometimes still shows both nodes as online. In other cases it may occur that the second node comes up again after rebooting, but it doesn't get found by the first node and appears offline. The network connection does not seem to have any problems. Communication is still possible between the nodes and I can see a lot of UDP traffic between both nodes. Most of the times I solve this by booting the unresponsive cluster node, too. This leads to other problems because my drbd devices become out of sync, services get stopped and so on. On the other hand, the node does not seem to heal itself, so no crm actions can successfully be executed. The last time that I dared to run a crm action was yesterday between 18 and 19 o'clock. I created a full hb_report that should contain all relevant information, including the pe-input files. I also enabled the debug logging mode for corosync, so extended logs are available, too. I used strace to find out what a simple crmadmin -D does. It ends with: --- uname({sys=Linux, node=s00201, ...}) = 0 uname({sys=Linux, node=s00201, ...}) = 0 uname({sys=Linux, node=s00201, ...}) = 0 uname({sys=Linux, node=s00201, ...}) = 0 uname({sys=Linux, node=s00201, ...}) = 0 futex(0x7fdccdcf2d48, FUTEX_WAKE_PRIVATE, 2147483647) = 0 mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fdccf39f000 socket(PF_LOCAL, SOCK_STREAM, 0)= 3 fcntl(3, F_GETFD) = 0 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 connect(3, {sa_family=AF_LOCAL, sun_path=@crmd}, 110) = 0 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 sendto(3, \377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0, 24, MSG_NOSIGNAL, NULL, 0) = 24 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0 recvfrom(3, 0x7fff4bc37d10, 12328, 16640, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=3, events=POLLIN}], 1, 4294967295Process 3781 detached detached ... --- (The full log is available) crmadmin tries to
[Linux-HA] Antw: Re: crmd (?) becomes unresponsive
Hi! We are living in a very distributed world, even when using one Linux distribution. Maybe those who know could post periodic reminders which problems to post where... Unfortunately my experience was a ping-pong one: One list sends you to another, because nobody really wants to hear about bugs ;-) Regards, Ulrich Lars Marowsky-Bree l...@suse.com schrieb am 22.01.2014 um 11:00 in Nachricht 20140122100050.gs8...@suse.de: On 2014-01-22T09:55:10, Thomas Schulte tho...@cupracer.de wrote: Hi Thomas, since those are very recent upstream versions, I think you'll have a better chance to ask directly on the pacemaker mailing list, or directly report via bugs.clusterlabs.org - at least for providing the attachments, that's the best option. Best, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: crmd (?) becomes unresponsive
On 2014-01-22T11:18:06, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: We are living in a very distributed world, even when using one Linux distribution. Maybe those who know could post periodic reminders which problems to post where... I thought that was what I just did? This is likely a pacemaker problem, and more pacemaker experts are subscribed to the clusterlabs list than here. Also, the right bugzilla for the attachments/logs. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: crmd (?) becomes unresponsive
Hi Lars, I thought that was what I just did? This is likely a pacemaker problem, and more pacemaker experts are subscribed to the clusterlabs list than here. Also, the right bugzilla for the attachments/logs. Regards, Lars thank you very much for your suggestion, I'm fine with that. I created a bug report @ClusterLabs for that and attached my files there. http://bugs.clusterlabs.org/show_bug.cgi?id=5192 Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] heartbeat failover
Hello, I got a drbd+nfs+heartbeat setup and in general it's working. But it takes to long to failover and I try to tune this. When node 1 is active and I shutdown node 2, then node 1 try to activate the cluster. The problem is, node 1 already got the primary role and when re-activating it take time again and during this the nfs share isn't available. Is it possible to disable this? Node 1 don't have to do anything if it's already in primary role and the second node is not available. Mit freundlichen Grüßen / Best regards Björn ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] heartbeat failover
On 22/01/14 10:44 AM, bjoern.bec...@easycash.de wrote: Hello, I got a drbd+nfs+heartbeat setup and in general it's working. But it takes to long to failover and I try to tune this. When node 1 is active and I shutdown node 2, then node 1 try to activate the cluster. The problem is, node 1 already got the primary role and when re-activating it take time again and during this the nfs share isn't available. Is it possible to disable this? Node 1 don't have to do anything if it's already in primary role and the second node is not available. Mit freundlichen Grüßen / Best regards Björn If this is a new project, I strongly recommend switching out heartbeat for corosync/pacemaker. Heartbeat is deprecated, hasn't been developed in a long time and there are no plans to restart development in the future. Everything (even RH) is standardizing on the corosync+pacemaker stack, so it has the most vibrant community as well. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems