Dejan Muhamedagic a écrit :
On Fri, Apr 27, 2007 at 10:13:31AM +0200, Benjamin Watine wrote:
Dejan Muhamedagic a écrit :
On Thu, Apr 26, 2007 at 04:02:47PM +0200, Benjamin Watine wrote:
OK. I've lauched the IPv6addr again (standalone, and managed by HB), it
crashes, but no core dump seems to be generated today. No file from
today in core dir. I don't know why.
The core file might be in the directory where you ran it from.
The file command show me some old IPv6addr core dumps, you can find
backtraces of it in the tar.gz generated by your script.
Thanks. Unfortunately, there is almost no debugging info. Most
cores are from stonithd (I'll investigate that, but it's
definitely in connection with the suicide agent), but there's also
one from cib on fclose() call, though we don't see when it
happened.
I'm about to apply the patch for suicide stonith proposed by Dave Blaschke :
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1550
Good.
It seems better even if some warn appears, but I have not taken time to
fully test suicide yet.
I join the "file root/*" output for you can find files easily. All these
core dumps are generated only by stonithd, IPv6addr, and pidof.
You should probably report the pidof problem too, to your distro's
bugzilla.
I'll replace this RedHat with a Debian, so it will fix the problem :)
Is debian a good choice for Heartbeat or is there is know problems ?
I believe that debian's heartbeat should be in good shape. Horms
is taking care of that :)
Well, I'll be back with Debian so :)
Thanks for all !
Ben
Regards
Ben
Dejan Muhamedagic a écrit :
On Thu, Apr 26, 2007 at 10:54:36AM +0200, Benjamin Watine wrote:
Dejan Muhamedagic a écrit :
On Wed, Apr 25, 2007 at 05:59:12PM +0200, Benjamin Watine wrote:
Dejan Muhamedagic a écrit :
On Wed, Apr 25, 2007 at 11:59:02AM +0200, Benjamin Watine wrote:
You were true, it wasn't a score problem, but my IPv6 resource that
causes an error, and let the resource group unstarted.
Without IPv6, all is OK, behaviour of Heartbeat fit my needs (start
on prefered node (castor), and failover after 3 fails). So, my
problem is IPv6 now.
The script seems to have a problem :
# /etc/ha.d/resource.d/IPv6addr 2001:660:6301:301::47:1 start
*** glibc detected *** free(): invalid next size (fast):
0x000000000050d340 ***
/etc/ha.d/resource.d//hto-mapfuncs: line 51: 4764 Aborted
$__SCRIPT_NAME start
2007/04/25_11:43:29 ERROR: Unknown error: 134
ERROR: Unknown error: 134
but now, ifconfig show that IPv6 is well configured, but script
exit with error code.
IPv6addr aborts, hence the exit code 134 (128+signo). Somebody
recently posted a set of patches for IPv6addr... Right, I'm cc-ing
this to Horms.
Thank you so much, I'm waiting for Horms so. I'll take a look to list
archive also.
BTW, wasn't there also a core dump for this case too? Could you do
a ls -R /var/lib/heartbeat/cores and check.
I don't know how to find core dump :/ In this case, should it be
core.22560 ?
Some newer releases of file(1) show the program name which dumped
the core:
$ file core.6468
core.6468: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV),
SVR4-style, from 'gaim'
Also, you can match the timestamps of core files and from the logs.
# /etc/ha.d/resource.d/IPv6addr 2001:660:6301:301::47:1 start
*** glibc detected *** free(): invalid next size (fast):
0x000000000050d340 ***
/etc/ha.d/resource.d//hto-mapfuncs: line 51: 22560 Aborted
$__SCRIPT_NAME start
2007/04/26_10:46:38 ERROR: Unknown error: 134
ERROR: Unknown error: 134
[EMAIL PROTECTED] ls -R /var/lib/heartbeat/cores
/var/lib/heartbeat/cores:
hacluster nobody root
/var/lib/heartbeat/cores/hacluster:
core.3620 core.4116 core.4119 core.4123 core.5262 core.5265
core.5269 core.5272
core.3626 core.4117 core.4121 core.4124 core.5263 core.5266
core.5270
core.3829 core.4118 core.4122 core.5256 core.5264 core.5268
core.5271
/var/lib/heartbeat/cores/nobody:
/var/lib/heartbeat/cores/root:
core.10766 core.21816 core.29951 core.3642 core.3650 core.3658
core.3667 core.4471
core.11379 core.23505 core.30813 core.3643 core.3651 core.3661
core.3668 core.4550
core.11592 core.24403 core.31033 core.3645 core.3652 core.3663
core.4234 core.5104
core.12928 core.24863 core.3489 core.3647 core.3653 core.3664
core.4371 core.5761
core.15849 core.25786 core.3591 core.3648 core.3654 core.3665
core.4394 core.6130
core.21501 core.28286 core.3610 core.3649 core.3657 core.3666
core.4470
[EMAIL PROTECTED]
Well, you have quite a few. Let's hope that they stem from only
those two errors.
I'll attach a script which should generate all backtraces from your
core files. It's been lightly tested but should work.
# ifconfig
eth0 Lien encap:Ethernet HWaddr 00:13:72:58:74:5F
inet adr:193.48.169.46 Bcast:193.48.169.63
Masque:255.255.255.224
adr inet6: 2001:660:6301:301:213:72ff:fe58:745f/64
Scope:Global
adr inet6: fe80::213:72ff:fe58:745f/64 Scope:Lien
adr inet6: 2001:660:6301:301::47:1/64 Scope:Global
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3788 errors:0 dropped:0 overruns:0 frame:0
TX packets:3992 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
RX bytes:450820 (440.2 KiB) TX bytes:844188 (824.4 KiB)
Adresse de base:0xecc0 Mémoire:fe6e0000-fe700000
And if I launch the script again, no error is returned :
# /etc/ha.d/resource.d/IPv6addr 2001:660:6301:301::47:1 start
2007/04/25_11:45:23 INFO: Success
INFO: Success
So, you're saying that once the resource is running, starting it
again doesn't produce an error? Did you also try to stop it and
start it from the stopped state?
Yes, but probably because the script just check that IPv6 is set, and
so don't try to set it again. If I stop and start again, the error
occurs.
For others errors, I disable stonith for the moment, and DRBD is
built in kernel, so the drbd module is not needed. I've seen this
message, but it's not a problem.
There's a small problem with the stonith suicide agent, which
renders it unusable, but it is soon to be fixed.
OK, that's what I had read on this list, but I wasn't sure. Is there
is any patch now ?
I joined log and config, and core file about stonith
(/var/lib/heartbeat/cores/root/core.3668). Is it what you asked for
(backtrace from stonith core dump) ?
You shouldn't be sending core dumps to a public list: it may
contain sensitive information. What I asked for, a backtrace, you
get like this:
$ gdb /usr/lib64/heartbeat/stonithd core.3668
(gdb) bt
... < here comes the backtrace
(gdb) quit
Ooops ! Here it is :
#0 0x00000039b9d03507 in stonith_free_hostlist () from
/usr/lib64/libstonith.so.1
#1 0x0000000000408a95 in ?? ()
#2 0x0000000000407fee in ?? ()
#3 0x00000000004073c3 in ?? ()
#4 0x000000000040539d in ?? ()
#5 0x0000000000405015 in ?? ()
#6 0x00000039b950abd4 in G_CH_dispatch_int () from
/usr/lib64/libplumb.so.1
#7 0x0000003a12a266bd in g_main_context_dispatch () from
/usr/lib64/libglib-2.0.so.0
#8 0x0000003a12a28397 in g_main_context_acquire () from
/usr/lib64/libglib-2.0.so.0
#9 0x0000003a12a28735 in g_main_loop_run () from
/usr/lib64/libglib-2.0.so.0
#10 0x000000000040341a in ?? ()
#11 0x0000003a0fd1c4ca in __libc_start_main () from
/lib64/tls/libc.so.6
#12 0x000000000040303a in ?? ()
#13 0x00007fff0f04b8d8 in ?? ()
#14 0x000000000000001c in ?? ()
#15 0x0000000000000001 in ?? ()
#16 0x00007fff0f04cb73 in ?? ()
#17 0x0000000000000000 in ?? ()
Thanks for taking time to explain me some basics...
You're welcome. As Andrew suggested, you should file bugs for both
of these. Interestingly, all those question marks mean that some
debugging info is missing, but then there is some in libplumb.so.
Odd. Where did you say your heartbeat package comes from?
heartbeat-2.0.8-2.el4.centos.x86_64.rpm
heartbeat-gui-2.0.8-2.el4.centos.x86_64.rpm
heartbeat-pils-2.0.8-2.el4.centos.x86_64.rpm
heartbeat-stonith-2.0.8-2.el4.centos.x86_64.rpm
>From here IIRC :
http://dev.centos.org/centos/4/testing/x86_64/RPMS/
Ben
Thanks.
Thanks a lot for helping.
Ben
Dejan Muhamedagic a écrit :
On Tue, Apr 24, 2007 at 06:36:04PM +0200, Benjamin Watine wrote:
Thank you, Dejan, for replying (I feel less alone now !)
That's good.
I've applied location constraint to only one resource (slapd) as
you suggest me, but it still doesn't work as expected.
I've read in the list archive that I have to sum all resources
stickiness of the group to get the real calculated resource
stickiness. It makes sense, let's do it :
resource-stickiness at 100,
resource-failure-stickiness at -400
score on slapd on node1 (castor) at 1600
score on slapd on node2 (pollux) at 1000
If I apply given relation (I have 6 resources in my group):
((1600 - 1000) + (6 * 100))) / 400 = 3
slapd should start on castor, and failback to pollux if it fails
3 times, right ?
But now my resource doesn't start at all. In the log, I can see :
Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color:
Resource ldap_drbddisk cannot run anywhere
Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color:
Resource ldap_Filesystem cannot run anywhere
Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color:
Resource ldap_IPaddr_193_48_169_47 cannot run anywhere
Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color:
Resource IPv6addr_ldap cannot run anywhere
Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color:
Resource ldap_slapd cannot run anywhere
Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color:
Resource ldap_MailTo cannot run anywhere
What does I do wrong ?? Is there an exhaustive documentation
about scores and location ? It's complex to do, but what I want
is really simple and common : "start on this node first and
failover on other node if you fail 3 times".
Log and config attached.
Log is OK. Several things in there:
Apr 24 17:54:53 pollux tengine: [4168]: ERROR:
stonithd_op_result_ready: failed due to not on signon status.
Apr 24 17:54:53 pollux tengine: [4168]: ERROR:
tengine_stonith_connection_destroy: Fencing daemon has left us
Apr 24 17:54:53 pollux heartbeat: [3574]: ERROR: Exiting
/usr/lib64/heartbeat/stonithd process 3668 dumped core
Perhaps you could give us a backtrace from this core dump.
Apr 24 17:56:08 pollux drbd: ERROR: Module drbd does not exist in
/proc/modules
A drbd setup problem?
Apr 24 17:54:53 pollux pengine: [4169]: ERROR: can_run_resources:
No node supplied
This is an interesting message. Can't find it in the development
code.
Apr 24 18:00:26 pollux IPv6addr: [4365]: ERROR: no valid mecahnisms
Apr 24 18:00:26 pollux crmd: [3647]: ERROR: process_lrm_event: LRM
operation IPv6addr_ldap_start_0 (call=22, rc=1) Error unknown error
Can I suggest to first have a regular working configuration which
includes all the resources and a sane and well behaving cluster?
Then we see if fiddling with constraints gives any results.
Thanks.
Thanks !
Ben
Dejan Muhamedagic a écrit :
On Fri, Apr 20, 2007 at 03:04:56PM +0200, Benjamin Watine wrote:
Hi the list
I'm trying to set location constraint for 2 resources group,
but I don't understand very well how it works.
I want to define a prefered node for each group, and tell
HeartBeat to move the group on the other node if 3 resources
fail (and restart) occurs.
So, I defined default-resource-stickiness at 100,
default-resource-failure-stickiness at -100, and put a score of
1200 on prefered node, and 1000 for "second" node.
((1200-1000+100)/100 = 3).
I'm trying to do this for 2 group. If 3 fails occurs for the
resource of a group, all the group have to be moved to the
other node. Can I configure group location constraint as for
resource ? How can I get group failcount (if it make sense) ?
I doubt that you can. The failcounts are only on a per primitive
basis. Groups are just shorthand for order and colocation
constraints. However, if you choose a resource which matters the
most to you (ldap/web service) and make location constraints on
them, the other resources in the group will follow should it move
to another node.
... and nothing works :p The resource group don't start on the
good node, and never failover if I manually stop 3 times a
resource of the group.
Some light about this location constraints would be greatly
appreciated !
I'm afraid that I can't offer help on calculating the scores.
There
has been, however, extensive discussion on the list on the matter
several months or a year ago. Perhaps you could search this
list's
archives.
cibadmin -Ql attached.
I can only see in the status that the failcount for httpd_web is
5.
Please include the logs, etc, see:
http://linux-ha.org/ReportingProblems
Thank you, in advance.
Ben
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
------------------------------------------------------------------------
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
/var/lib/heartbeat/cores# file root/*
root/core.10917: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.11711: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.12305: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.12407: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'IPv6addr'
root/core.13407: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'pidof'
root/core.14899: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.17390: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'pidof'
root/core.18875: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'pidof'
root/core.19244: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'pidof'
root/core.20147: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.20441: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'IPv6addr'
root/core.21125: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'IPv6addr'
root/core.22022: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.24140: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'pidof'
root/core.24525: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.25783: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'IPv6addr'
root/core.26996: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'pidof'
root/core.29184: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'pidof'
root/core.31357: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'IPv6addr'
root/core.3898: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.3938: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.3957: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.3961: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.3963: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.3964: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.3966: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.3967: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.3968: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.3969: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.3974: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.3976: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.3977: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.3988: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.4002: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.4053: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.4256: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'IPv6addr'
root/core.4259: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'IPv6addr'
root/core.4290: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.4310: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'IPv6addr'
root/core.4355: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'IPv6addr'
root/core.4362: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'IPv6addr'
root/core.4363: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'IPv6addr'
root/core.4368: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'IPv6addr'
root/core.4639: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'pidof'
root/core.490: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'pidof'
root/core.4990: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'pidof'
root/core.574: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.6148: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.7763: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'pidof'
root/core.8308: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'stonithd'
root/core.979: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'IPv6addr'
root/core.983: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV),
SVR4-style, from 'IPv6addr'
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems