[Linux-HA] crmd (?) becomes unresponsive

2014-01-22 Thread Thomas Schulte

Hi all,

I'm experiencing difficulties with my 2-node cluster and I'm running
out of ideas about how to fix this. I'd be glad if someone here
could point me to the right direction.

As said, it's a 2-node cluster, running with openSUSE 13.1 and the 
HA-Factory packages:


cluster-glue: 1.0.12-rc1 (b5f1605097857b8b96bd517282ab300e2ad7af99)
resource-agents: # Build version: 
f725724964882a407f7f33a97124da07a2b28d5d
CRM Version: 1.1.10+git20140117.a3cda76-102.1 
(1.1.10+git20140117.a3cda76)
pacemaker 1.1.10+git20140117.a3cda76-102.1 - 
network:ha-clustering:Factory / openSUSE_13.1 x86_64
libpacemaker3 1.1.10+git20140117.a3cda76-102.1 - 
network:ha-clustering:Factory / openSUSE_13.1 x86_64
corosync 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1 
x86_64
libcorosync4 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1 
x86_64
resource-agents 3.9.5-63.1 - network:ha-clustering:Factory / 
openSUSE_13.1 x86_64
cluster-glue 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory / 
openSUSE_13.1 x86_64
libglue2 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory / 
openSUSE_13.1 x86_64
ldirectord 3.9.5-63.1 - network:ha-clustering:Factory / openSUSE_13.1 
x86_64


Both nodes have two NIC's, one is connected to the world and the other 
one
connects both nodes with a crossover cable. An internal subnet is used 
here

and my /etc/hosts files are fine:

127.0.0.1   localhost
10.0.0.1s00201.ser4.de s00201
10.0.0.2s00202.ser4.de s00202

Corosync is configured with udpu and the firewall does not block any 
traffic between the interal NIC's.



My cluster is up and running, both nodes are providing some services,
filesystems are mirrored by drbd and the world is a happy place. :-)
The cluster uses a valid and available DC. Editing and executing actions
on resources is usually working fine.

Sometimes, when I run a command like crm resource migrate grp_nginx
or just a crm resource cleanup pri_svc_varnish it may happen that 
those

commands don't return but timeout after a while. At this state even
a crmadmin -D does not return.

This happened a lot of times in the last days (I migrated to openSUSE 
13.1 last week),
so I tried different things to clear the problem, but nothing seems to 
work.
I may happen that the STONITH mechanism is executed for one of the 
nodes.
Interestingly the other node does not seem to recognize that it's alone 
then.
crm status sometimes still shows both nodes as online. In other 
cases
it may occur that the second node comes up again after rebooting, but it 
doesn't

get found by the first node and appears offline.
The network connection does not seem to have any problems.
Communication is still possible between the nodes and I can see a lot of
UDP traffic between both nodes.

Most of the times I solve this by booting the unresponsive cluster 
node, too.

This leads to other problems because my drbd devices become out of sync,
services get stopped and so on. On the other hand, the node does not 
seem

to heal itself, so no crm actions can successfully be executed.

The last time that I dared to run a crm action was yesterday between 
18 and 19 o'clock.

I created a full hb_report that should contain all relevant information,
including the pe-input files. I also enabled the debug logging mode for 
corosync,

so extended logs are available, too.

I used strace to find out what a simple crmadmin -D does. It ends 
with:


---
uname({sys=Linux, node=s00201, ...}) = 0
uname({sys=Linux, node=s00201, ...}) = 0
uname({sys=Linux, node=s00201, ...}) = 0
uname({sys=Linux, node=s00201, ...}) = 0
uname({sys=Linux, node=s00201, ...}) = 0
futex(0x7fdccdcf2d48, FUTEX_WAKE_PRIVATE, 2147483647) = 0
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
0) = 0x7fdccf39f000

socket(PF_LOCAL, SOCK_STREAM, 0)= 3
fcntl(3, F_GETFD)   = 0
fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK)  = 0
connect(3, {sa_family=AF_LOCAL, sun_path=@crmd}, 110) = 0
setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
sendto(3, \377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0, 
24, MSG_NOSIGNAL, NULL, 0) = 24

setsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
recvfrom(3, 0x7fff4bc37d10, 12328, 16640, 0, 0) = -1 EAGAIN (Resource 
temporarily unavailable)

poll([{fd=3, events=POLLIN}], 1, 4294967295Process 3781 detached
 detached ...
---

(The full log is available)

crmadmin tries to reach crmd, so I also straced the running crmd 
process.

There's not much happening here:

---
Process 8669 attached
read(22, Process 8669 detached
 detached ...
---

I killed the crmd process and it got restarted automatically (by 
pacemakerd?).

After that, strace just shows countless messages like these:

---
Process 7856 attached
poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=21, 
events=POLLIN}], 3, 3000) = 2 ([{fd=20, revents=POLLHUP}, {fd=22, 
revents=POLLHUP}])

Re: [Linux-HA] crmd (?) becomes unresponsive

2014-01-22 Thread Lars Marowsky-Bree
On 2014-01-22T09:55:10, Thomas Schulte tho...@cupracer.de wrote:

Hi Thomas,

since those are very recent upstream versions, I think you'll have a
better chance to ask directly on the pacemaker mailing list, or directly
report via bugs.clusterlabs.org - at least for providing the
attachments, that's the best option.


Best,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Antw: crmd (?) becomes unresponsive

2014-01-22 Thread Ulrich Windl
Hi!

I cannot really help you, but it proved to be helpful to open a wide fail -f 
/var/log/messages window for every cluster node while issuing the actual 
commands in another window. Maybe you could also watch the cluster with hawk or 
crm_mon. My favourite option set is -1Arf...

Regards,
Ulrich

 Thomas Schulte tho...@cupracer.de schrieb am 22.01.2014 um 09:55 in 
 Nachricht
09d9d36ad571203a5b9b048da373d...@ser4.de:
 Hi all,
 
 I'm experiencing difficulties with my 2-node cluster and I'm running
 out of ideas about how to fix this. I'd be glad if someone here
 could point me to the right direction.
 
 As said, it's a 2-node cluster, running with openSUSE 13.1 and the 
 HA-Factory packages:
 
 cluster-glue: 1.0.12-rc1 (b5f1605097857b8b96bd517282ab300e2ad7af99)
 resource-agents: # Build version: 
 f725724964882a407f7f33a97124da07a2b28d5d
 CRM Version: 1.1.10+git20140117.a3cda76-102.1 
 (1.1.10+git20140117.a3cda76)
 pacemaker 1.1.10+git20140117.a3cda76-102.1 - 
 network:ha-clustering:Factory / openSUSE_13.1 x86_64
 libpacemaker3 1.1.10+git20140117.a3cda76-102.1 - 
 network:ha-clustering:Factory / openSUSE_13.1 x86_64
 corosync 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1 
 x86_64
 libcorosync4 2.3.2-48.1 - network:ha-clustering:Factory / openSUSE_13.1 
 x86_64
 resource-agents 3.9.5-63.1 - network:ha-clustering:Factory / 
 openSUSE_13.1 x86_64
 cluster-glue 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory / 
 openSUSE_13.1 x86_64
 libglue2 1.0.12-0.rc1.69.1 - network:ha-clustering:Factory / 
 openSUSE_13.1 x86_64
 ldirectord 3.9.5-63.1 - network:ha-clustering:Factory / openSUSE_13.1 
 x86_64
 
 Both nodes have two NIC's, one is connected to the world and the other 
 one
 connects both nodes with a crossover cable. An internal subnet is used 
 here
 and my /etc/hosts files are fine:
 
 127.0.0.1   localhost
 10.0.0.1s00201.ser4.de s00201
 10.0.0.2s00202.ser4.de s00202
 
 Corosync is configured with udpu and the firewall does not block any 
 traffic between the interal NIC's.
 
 
 My cluster is up and running, both nodes are providing some services,
 filesystems are mirrored by drbd and the world is a happy place. :-)
 The cluster uses a valid and available DC. Editing and executing actions
 on resources is usually working fine.
 
 Sometimes, when I run a command like crm resource migrate grp_nginx
 or just a crm resource cleanup pri_svc_varnish it may happen that 
 those
 commands don't return but timeout after a while. At this state even
 a crmadmin -D does not return.
 
 This happened a lot of times in the last days (I migrated to openSUSE 
 13.1 last week),
 so I tried different things to clear the problem, but nothing seems to 
 work.
 I may happen that the STONITH mechanism is executed for one of the 
 nodes.
 Interestingly the other node does not seem to recognize that it's alone 
 then.
 crm status sometimes still shows both nodes as online. In other 
 cases
 it may occur that the second node comes up again after rebooting, but it 
 doesn't
 get found by the first node and appears offline.
 The network connection does not seem to have any problems.
 Communication is still possible between the nodes and I can see a lot of
 UDP traffic between both nodes.
 
 Most of the times I solve this by booting the unresponsive cluster 
 node, too.
 This leads to other problems because my drbd devices become out of sync,
 services get stopped and so on. On the other hand, the node does not 
 seem
 to heal itself, so no crm actions can successfully be executed.
 
 The last time that I dared to run a crm action was yesterday between 
 18 and 19 o'clock.
 I created a full hb_report that should contain all relevant information,
 including the pe-input files. I also enabled the debug logging mode for 
 corosync,
 so extended logs are available, too.
 
 I used strace to find out what a simple crmadmin -D does. It ends 
 with:
 
 ---
 uname({sys=Linux, node=s00201, ...}) = 0
 uname({sys=Linux, node=s00201, ...}) = 0
 uname({sys=Linux, node=s00201, ...}) = 0
 uname({sys=Linux, node=s00201, ...}) = 0
 uname({sys=Linux, node=s00201, ...}) = 0
 futex(0x7fdccdcf2d48, FUTEX_WAKE_PRIVATE, 2147483647) = 0
 mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
 0) = 0x7fdccf39f000
 socket(PF_LOCAL, SOCK_STREAM, 0)= 3
 fcntl(3, F_GETFD)   = 0
 fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
 fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK)  = 0
 connect(3, {sa_family=AF_LOCAL, sun_path=@crmd}, 110) = 0
 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
 sendto(3, \377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0, 
 24, MSG_NOSIGNAL, NULL, 0) = 24
 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
 recvfrom(3, 0x7fff4bc37d10, 12328, 16640, 0, 0) = -1 EAGAIN (Resource 
 temporarily unavailable)
 poll([{fd=3, events=POLLIN}], 1, 4294967295Process 3781 detached
   detached ...
 ---
 
 (The full log is available)
 
 crmadmin tries to 

[Linux-HA] Antw: Re: crmd (?) becomes unresponsive

2014-01-22 Thread Ulrich Windl
Hi!

We are living in a very distributed world, even when using one Linux
distribution. Maybe those who know could post periodic reminders which problems
to post where...

Unfortunately my experience was a ping-pong one: One list sends you to
another, because nobody really wants to hear about bugs ;-)

Regards,
Ulrich

 Lars Marowsky-Bree l...@suse.com schrieb am 22.01.2014 um 11:00 in
Nachricht
20140122100050.gs8...@suse.de:
 On 2014-01-22T09:55:10, Thomas Schulte tho...@cupracer.de wrote:
 
 Hi Thomas,
 
 since those are very recent upstream versions, I think you'll have a
 better chance to ask directly on the pacemaker mailing list, or directly
 report via bugs.clusterlabs.org - at least for providing the
 attachments, that's the best option.
 
 
 Best,
 Lars
 
 -- 
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,

 HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: crmd (?) becomes unresponsive

2014-01-22 Thread Lars Marowsky-Bree
On 2014-01-22T11:18:06, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 We are living in a very distributed world, even when using one Linux
 distribution. Maybe those who know could post periodic reminders which 
 problems
 to post where...

I thought that was what I just did? This is likely a pacemaker problem,
and more pacemaker experts are subscribed to the clusterlabs list than
here. Also, the right bugzilla for the attachments/logs.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: crmd (?) becomes unresponsive

2014-01-22 Thread Thomas Schulte

Hi Lars,


I thought that was what I just did? This is likely a pacemaker problem,
and more pacemaker experts are subscribed to the clusterlabs list than
here. Also, the right bugzilla for the attachments/logs.


Regards,
Lars



thank you very much for your suggestion, I'm fine with that.

I created a bug report @ClusterLabs for that and attached my files 
there.

http://bugs.clusterlabs.org/show_bug.cgi?id=5192



Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] heartbeat failover

2014-01-22 Thread Bjoern.Becker
Hello,

I got a drbd+nfs+heartbeat setup and in general it's working. But it takes to 
long to failover and I try to tune this.

When node 1 is active and I shutdown node 2, then node 1 try to activate the 
cluster.
The problem is, node 1 already got the primary role and when re-activating it 
take time again and during this the nfs share isn't available.

Is it possible to disable this? Node 1 don't have to do anything if it's 
already in primary role and the second node is not available.

Mit freundlichen Grüßen / Best regards
Björn



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat failover

2014-01-22 Thread Digimer

On 22/01/14 10:44 AM, bjoern.bec...@easycash.de wrote:

Hello,

I got a drbd+nfs+heartbeat setup and in general it's working. But it takes to 
long to failover and I try to tune this.

When node 1 is active and I shutdown node 2, then node 1 try to activate the 
cluster.
The problem is, node 1 already got the primary role and when re-activating it 
take time again and during this the nfs share isn't available.

Is it possible to disable this? Node 1 don't have to do anything if it's 
already in primary role and the second node is not available.

Mit freundlichen Grüßen / Best regards
Björn


If this is a new project, I strongly recommend switching out heartbeat 
for corosync/pacemaker. Heartbeat is deprecated, hasn't been developed 
in a long time and there are no plans to restart development in the 
future. Everything (even RH) is standardizing on the corosync+pacemaker 
stack, so it has the most vibrant community as well.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems