On Thu, Apr 26, 2007 at 10:54:36AM +0200, Benjamin Watine wrote:
> Dejan Muhamedagic a écrit :
> >On Wed, Apr 25, 2007 at 05:59:12PM +0200, Benjamin Watine wrote:
> >>Dejan Muhamedagic a écrit :
> >>>On Wed, Apr 25, 2007 at 11:59:02AM +0200, Benjamin Watine wrote:
> >>>>You were true, it wasn't a score problem, but my IPv6 resource that 
> >>>>causes an error, and let the resource group unstarted.
> >>>>
> >>>>Without IPv6, all is OK, behaviour of Heartbeat fit my needs (start on 
> >>>>prefered node (castor), and failover after 3 fails). So, my problem is 
> >>>>IPv6 now.
> >>>>
> >>>>The script seems to have a problem :
> >>>>
> >>>># /etc/ha.d/resource.d/IPv6addr 2001:660:6301:301::47:1 start
> >>>>*** glibc detected *** free(): invalid next size (fast): 
> >>>>0x000000000050d340 ***
> >>>>/etc/ha.d/resource.d//hto-mapfuncs: line 51:  4764 Aborted 
> >>>>  $__SCRIPT_NAME start
> >>>>2007/04/25_11:43:29 ERROR:  Unknown error: 134
> >>>>ERROR:  Unknown error: 134
> >>>>
> >>>>but now, ifconfig show that IPv6 is well configured, but script exit 
> >>>>with error code.
> >>>IPv6addr aborts, hence the exit code 134 (128+signo). Somebody
> >>>recently posted a set of patches for IPv6addr... Right, I'm cc-ing
> >>>this to Horms.
> >>>
> >>Thank you so much, I'm waiting for Horms so. I'll take a look to list 
> >>archive also.
> >
> >BTW, wasn't there also a core dump for this case too? Could you do
> >a ls -R /var/lib/heartbeat/cores and check.
> >
> 
> I don't know how to find core dump :/ In this case, should it be 
> core.22560 ?

Some newer releases of file(1) show the program name which dumped
the core:

$ file core.6468 
core.6468: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, 
from 'gaim'

Also, you can match the timestamps of core files and from the logs.

> # /etc/ha.d/resource.d/IPv6addr 2001:660:6301:301::47:1 start
> *** glibc detected *** free(): invalid next size (fast): 
> 0x000000000050d340 ***
> /etc/ha.d/resource.d//hto-mapfuncs: line 51: 22560 Aborted 
>    $__SCRIPT_NAME start
> 2007/04/26_10:46:38 ERROR:  Unknown error: 134
> ERROR:  Unknown error: 134
> [EMAIL PROTECTED] ls -R /var/lib/heartbeat/cores
> /var/lib/heartbeat/cores:
> hacluster  nobody  root
> 
> /var/lib/heartbeat/cores/hacluster:
> core.3620  core.4116  core.4119  core.4123  core.5262  core.5265 
> core.5269  core.5272
> core.3626  core.4117  core.4121  core.4124  core.5263  core.5266  core.5270
> core.3829  core.4118  core.4122  core.5256  core.5264  core.5268  core.5271
> 
> /var/lib/heartbeat/cores/nobody:
> 
> /var/lib/heartbeat/cores/root:
> core.10766  core.21816  core.29951  core.3642  core.3650  core.3658 
> core.3667  core.4471
> core.11379  core.23505  core.30813  core.3643  core.3651  core.3661 
> core.3668  core.4550
> core.11592  core.24403  core.31033  core.3645  core.3652  core.3663 
> core.4234  core.5104
> core.12928  core.24863  core.3489   core.3647  core.3653  core.3664 
> core.4371  core.5761
> core.15849  core.25786  core.3591   core.3648  core.3654  core.3665 
> core.4394  core.6130
> core.21501  core.28286  core.3610   core.3649  core.3657  core.3666 
> core.4470
> [EMAIL PROTECTED]

Well, you have quite a few. Let's hope that they stem from only
those two errors.

I'll attach a script which should generate all backtraces from your
core files. It's been lightly tested but should work.

> >>>># ifconfig
> >>>>eth0      Lien encap:Ethernet  HWaddr 00:13:72:58:74:5F
> >>>>         inet adr:193.48.169.46  Bcast:193.48.169.63 
> >>>>Masque:255.255.255.224
> >>>>         adr inet6: 2001:660:6301:301:213:72ff:fe58:745f/64 Scope:Global
> >>>>         adr inet6: fe80::213:72ff:fe58:745f/64 Scope:Lien
> >>>>         adr inet6: 2001:660:6301:301::47:1/64 Scope:Global
> >>>>         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >>>>         RX packets:3788 errors:0 dropped:0 overruns:0 frame:0
> >>>>         TX packets:3992 errors:0 dropped:0 overruns:0 carrier:0
> >>>>         collisions:0 lg file transmission:1000
> >>>>         RX bytes:450820 (440.2 KiB)  TX bytes:844188 (824.4 KiB)
> >>>>         Adresse de base:0xecc0 Mémoire:fe6e0000-fe700000
> >>>>
> >>>>And if I launch the script again, no error is returned :
> >>>>
> >>>># /etc/ha.d/resource.d/IPv6addr 2001:660:6301:301::47:1 start
> >>>>2007/04/25_11:45:23 INFO:  Success
> >>>>INFO:  Success
> >
> >So, you're saying that once the resource is running, starting it
> >again doesn't produce an error? Did you also try to stop it and
> >start it from the stopped state?
> >
> 
> Yes, but probably because the script just check that IPv6 is set, and so 
> don't try to set it again. If I stop and start again, the error occurs.
> 
> >>>>For others errors, I disable stonith for the moment, and DRBD is built 
> >>>>in kernel, so the drbd module is not needed. I've seen this message, 
> >>>>but it's not a problem.
> >>>There's a small problem with the stonith suicide agent, which
> >>>renders it unusable, but it is soon to be fixed.
> >>>
> >>OK, that's what I had read on this list, but I wasn't sure. Is there is 
> >>any patch now ?
> >>
> >>>>I joined log and config, and core file about stonith 
> >>>>(/var/lib/heartbeat/cores/root/core.3668). Is it what you asked for 
> >>>>(backtrace from stonith core dump) ?
> >>>You shouldn't be sending core dumps to a public list: it may
> >>>contain sensitive information. What I asked for, a backtrace, you
> >>>get like this:
> >>>
> >>>$ gdb /usr/lib64/heartbeat/stonithd core.3668
> >>>(gdb) bt
> >>>...  < here comes the backtrace
> >>>(gdb) quit
> >>>
> >>Ooops ! Here it is :
> >>
> >>#0  0x00000039b9d03507 in stonith_free_hostlist () from 
> >>/usr/lib64/libstonith.so.1
> >>#1  0x0000000000408a95 in ?? ()
> >>#2  0x0000000000407fee in ?? ()
> >>#3  0x00000000004073c3 in ?? ()
> >>#4  0x000000000040539d in ?? ()
> >>#5  0x0000000000405015 in ?? ()
> >>#6  0x00000039b950abd4 in G_CH_dispatch_int () from 
> >>/usr/lib64/libplumb.so.1
> >>#7  0x0000003a12a266bd in g_main_context_dispatch () from 
> >>/usr/lib64/libglib-2.0.so.0
> >>#8  0x0000003a12a28397 in g_main_context_acquire () from 
> >>/usr/lib64/libglib-2.0.so.0
> >>#9  0x0000003a12a28735 in g_main_loop_run () from 
> >>/usr/lib64/libglib-2.0.so.0
> >>#10 0x000000000040341a in ?? ()
> >>#11 0x0000003a0fd1c4ca in __libc_start_main () from /lib64/tls/libc.so.6
> >>#12 0x000000000040303a in ?? ()
> >>#13 0x00007fff0f04b8d8 in ?? ()
> >>#14 0x000000000000001c in ?? ()
> >>#15 0x0000000000000001 in ?? ()
> >>#16 0x00007fff0f04cb73 in ?? ()
> >>#17 0x0000000000000000 in ?? ()
> >>
> >>Thanks for taking time to explain me some basics...
> >
> >You're welcome. As Andrew suggested, you should file bugs for both
> >of these. Interestingly, all those question marks mean that some
> >debugging info is missing, but then there is some in libplumb.so.
> >Odd. Where did you say your heartbeat package comes from?
> >
> heartbeat-2.0.8-2.el4.centos.x86_64.rpm
> heartbeat-gui-2.0.8-2.el4.centos.x86_64.rpm
> heartbeat-pils-2.0.8-2.el4.centos.x86_64.rpm
> heartbeat-stonith-2.0.8-2.el4.centos.x86_64.rpm
> 
> From here IIRC :
> http://dev.centos.org/centos/4/testing/x86_64/RPMS/
> 
> 
> >>Ben
> >>
> >>
> >>>Thanks.
> >>>
> >>>>Thanks a lot for helping.
> >>>>
> >>>>Ben
> >>>>
> >>>>Dejan Muhamedagic a écrit :
> >>>>>On Tue, Apr 24, 2007 at 06:36:04PM +0200, Benjamin Watine wrote:
> >>>>>>Thank you, Dejan, for replying (I feel less alone now !)
> >>>>>That's good.
> >>>>>
> >>>>>>I've applied location constraint to only one resource (slapd) as you 
> >>>>>>suggest me, but it still doesn't work as expected.
> >>>>>>
> >>>>>>I've read in the list archive that I have to sum all resources 
> >>>>>>stickiness of the group to get the real calculated resource 
> >>>>>>stickiness. It makes sense, let's do it :
> >>>>>>
> >>>>>>resource-stickiness at 100,
> >>>>>>resource-failure-stickiness at -400
> >>>>>>score on slapd on node1 (castor) at 1600
> >>>>>>score on slapd on node2 (pollux) at 1000
> >>>>>>
> >>>>>>If I apply given relation (I have 6 resources in my group):
> >>>>>>((1600 - 1000) + (6 * 100))) / 400 = 3
> >>>>>>
> >>>>>>slapd should start on castor, and failback to pollux if it fails 3 
> >>>>>>times, right ?
> >>>>>>
> >>>>>>But now my resource doesn't start at all. In the log, I can see :
> >>>>>>
> >>>>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource 
> >>>>>>ldap_drbddisk cannot run anywhere
> >>>>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource 
> >>>>>>ldap_Filesystem cannot run anywhere
> >>>>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource 
> >>>>>>ldap_IPaddr_193_48_169_47 cannot run anywhere
> >>>>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource 
> >>>>>>IPv6addr_ldap cannot run anywhere
> >>>>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource 
> >>>>>>ldap_slapd cannot run anywhere
> >>>>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource 
> >>>>>>ldap_MailTo cannot run anywhere
> >>>>>>
> >>>>>>What does I do wrong ?? Is there an exhaustive documentation about 
> >>>>>>scores and location ? It's complex to do, but what I want is really 
> >>>>>>simple and common : "start on this node first and failover on other 
> >>>>>>node if you fail 3 times".
> >>>>>>
> >>>>>>Log and config attached.
> >>>>>Log is OK. Several things in there:
> >>>>>
> >>>>>Apr 24 17:54:53 pollux tengine: [4168]: ERROR: 
> >>>>>stonithd_op_result_ready: failed due to not on signon status.
> >>>>>Apr 24 17:54:53 pollux tengine: [4168]: ERROR: 
> >>>>>tengine_stonith_connection_destroy: Fencing daemon has left us
> >>>>>Apr 24 17:54:53 pollux heartbeat: [3574]: ERROR: Exiting 
> >>>>>/usr/lib64/heartbeat/stonithd process 3668 dumped core
> >>>>>
> >>>>>Perhaps you could give us a backtrace from this core dump.
> >>>>>
> >>>>>Apr 24 17:56:08 pollux drbd: ERROR: Module drbd does not exist in 
> >>>>>/proc/modules
> >>>>>
> >>>>>A drbd setup problem?
> >>>>>
> >>>>>Apr 24 17:54:53 pollux pengine: [4169]: ERROR: can_run_resources: No 
> >>>>>node supplied
> >>>>>
> >>>>>This is an interesting message. Can't find it in the development code.
> >>>>>
> >>>>>Apr 24 18:00:26 pollux IPv6addr: [4365]: ERROR: no valid mecahnisms
> >>>>>Apr 24 18:00:26 pollux crmd: [3647]: ERROR: process_lrm_event: LRM 
> >>>>>operation IPv6addr_ldap_start_0 (call=22, rc=1) Error unknown error
> >>>>>
> >>>>>Can I suggest to first have a regular working configuration which
> >>>>>includes all the resources and a sane and well behaving cluster?
> >>>>>Then we see if fiddling with constraints gives any results.
> >>>>>
> >>>>>Thanks.
> >>>>>
> >>>>>>Thanks !
> >>>>>>
> >>>>>>Ben
> >>>>>>
> >>>>>>Dejan Muhamedagic a écrit :
> >>>>>>>On Fri, Apr 20, 2007 at 03:04:56PM +0200, Benjamin Watine wrote:
> >>>>>>>>Hi the list
> >>>>>>>>
> >>>>>>>>I'm trying to set location constraint for 2 resources group, but I 
> >>>>>>>>don't understand very well how it works.
> >>>>>>>>I want to define a prefered node for each group, and tell HeartBeat 
> >>>>>>>>to move the group on the other node if 3 resources fail (and 
> >>>>>>>>restart) occurs.
> >>>>>>>>
> >>>>>>>>So, I defined default-resource-stickiness at 100, 
> >>>>>>>>default-resource-failure-stickiness at -100, and put a score of 
> >>>>>>>>1200 on prefered node, and 1000 for "second" node. 
> >>>>>>>>((1200-1000+100)/100 = 3).
> >>>>>>>>
> >>>>>>>>I'm trying to do this for 2 group. If 3 fails occurs for the 
> >>>>>>>>resource of a group, all the group have to be moved to the other 
> >>>>>>>>node. Can I configure group location constraint as for resource ? 
> >>>>>>>>How can I get group failcount (if it make sense) ?
> >>>>>>>I doubt that you can. The failcounts are only on a per primitive
> >>>>>>>basis. Groups are just shorthand for order and colocation
> >>>>>>>constraints. However, if you choose a resource which matters the
> >>>>>>>most to you (ldap/web service) and make location constraints on
> >>>>>>>them, the other resources in the group will follow should it move
> >>>>>>>to another node.
> >>>>>>>
> >>>>>>>>... and nothing works :p The resource group don't start on the good 
> >>>>>>>>node, and never failover if I manually stop 3 times a resource of 
> >>>>>>>>the group.
> >>>>>>>>
> >>>>>>>>Some light about this location constraints would be greatly 
> >>>>>>>>appreciated !
> >>>>>>>I'm afraid that I can't offer help on calculating the scores. There
> >>>>>>>has been, however, extensive discussion on the list on the matter
> >>>>>>>several months or a year ago. Perhaps you could search this list's
> >>>>>>>archives.
> >>>>>>>
> >>>>>>>>cibadmin -Ql attached.
> >>>>>>>I can only see in the status that the failcount for httpd_web is 5.
> >>>>>>>
> >>>>>>>Please include the logs, etc, see:
> >>>>>>>
> >>>>>>> http://linux-ha.org/ReportingProblems
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>Thank you, in advance.
> >>>>>>>>
> >>>>>>>>Ben
> >>_______________________________________________
> >>Linux-HA mailing list
> >>[email protected]
> >>http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >>See also: http://linux-ha.org/ReportingProblems
> >
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

-- 
Dejan

Attachment: gethbbt.sh
Description: Bourne shell script

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to