On Thu, Apr 26, 2007 at 10:54:36AM +0200, Benjamin Watine wrote: > Dejan Muhamedagic a écrit : > >On Wed, Apr 25, 2007 at 05:59:12PM +0200, Benjamin Watine wrote: > >>Dejan Muhamedagic a écrit : > >>>On Wed, Apr 25, 2007 at 11:59:02AM +0200, Benjamin Watine wrote: > >>>>You were true, it wasn't a score problem, but my IPv6 resource that > >>>>causes an error, and let the resource group unstarted. > >>>> > >>>>Without IPv6, all is OK, behaviour of Heartbeat fit my needs (start on > >>>>prefered node (castor), and failover after 3 fails). So, my problem is > >>>>IPv6 now. > >>>> > >>>>The script seems to have a problem : > >>>> > >>>># /etc/ha.d/resource.d/IPv6addr 2001:660:6301:301::47:1 start > >>>>*** glibc detected *** free(): invalid next size (fast): > >>>>0x000000000050d340 *** > >>>>/etc/ha.d/resource.d//hto-mapfuncs: line 51: 4764 Aborted > >>>> $__SCRIPT_NAME start > >>>>2007/04/25_11:43:29 ERROR: Unknown error: 134 > >>>>ERROR: Unknown error: 134 > >>>> > >>>>but now, ifconfig show that IPv6 is well configured, but script exit > >>>>with error code. > >>>IPv6addr aborts, hence the exit code 134 (128+signo). Somebody > >>>recently posted a set of patches for IPv6addr... Right, I'm cc-ing > >>>this to Horms. > >>> > >>Thank you so much, I'm waiting for Horms so. I'll take a look to list > >>archive also. > > > >BTW, wasn't there also a core dump for this case too? Could you do > >a ls -R /var/lib/heartbeat/cores and check. > > > > I don't know how to find core dump :/ In this case, should it be > core.22560 ?
Some newer releases of file(1) show the program name which dumped the core: $ file core.6468 core.6468: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'gaim' Also, you can match the timestamps of core files and from the logs. > # /etc/ha.d/resource.d/IPv6addr 2001:660:6301:301::47:1 start > *** glibc detected *** free(): invalid next size (fast): > 0x000000000050d340 *** > /etc/ha.d/resource.d//hto-mapfuncs: line 51: 22560 Aborted > $__SCRIPT_NAME start > 2007/04/26_10:46:38 ERROR: Unknown error: 134 > ERROR: Unknown error: 134 > [EMAIL PROTECTED] ls -R /var/lib/heartbeat/cores > /var/lib/heartbeat/cores: > hacluster nobody root > > /var/lib/heartbeat/cores/hacluster: > core.3620 core.4116 core.4119 core.4123 core.5262 core.5265 > core.5269 core.5272 > core.3626 core.4117 core.4121 core.4124 core.5263 core.5266 core.5270 > core.3829 core.4118 core.4122 core.5256 core.5264 core.5268 core.5271 > > /var/lib/heartbeat/cores/nobody: > > /var/lib/heartbeat/cores/root: > core.10766 core.21816 core.29951 core.3642 core.3650 core.3658 > core.3667 core.4471 > core.11379 core.23505 core.30813 core.3643 core.3651 core.3661 > core.3668 core.4550 > core.11592 core.24403 core.31033 core.3645 core.3652 core.3663 > core.4234 core.5104 > core.12928 core.24863 core.3489 core.3647 core.3653 core.3664 > core.4371 core.5761 > core.15849 core.25786 core.3591 core.3648 core.3654 core.3665 > core.4394 core.6130 > core.21501 core.28286 core.3610 core.3649 core.3657 core.3666 > core.4470 > [EMAIL PROTECTED] Well, you have quite a few. Let's hope that they stem from only those two errors. I'll attach a script which should generate all backtraces from your core files. It's been lightly tested but should work. > >>>># ifconfig > >>>>eth0 Lien encap:Ethernet HWaddr 00:13:72:58:74:5F > >>>> inet adr:193.48.169.46 Bcast:193.48.169.63 > >>>>Masque:255.255.255.224 > >>>> adr inet6: 2001:660:6301:301:213:72ff:fe58:745f/64 Scope:Global > >>>> adr inet6: fe80::213:72ff:fe58:745f/64 Scope:Lien > >>>> adr inet6: 2001:660:6301:301::47:1/64 Scope:Global > >>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > >>>> RX packets:3788 errors:0 dropped:0 overruns:0 frame:0 > >>>> TX packets:3992 errors:0 dropped:0 overruns:0 carrier:0 > >>>> collisions:0 lg file transmission:1000 > >>>> RX bytes:450820 (440.2 KiB) TX bytes:844188 (824.4 KiB) > >>>> Adresse de base:0xecc0 Mémoire:fe6e0000-fe700000 > >>>> > >>>>And if I launch the script again, no error is returned : > >>>> > >>>># /etc/ha.d/resource.d/IPv6addr 2001:660:6301:301::47:1 start > >>>>2007/04/25_11:45:23 INFO: Success > >>>>INFO: Success > > > >So, you're saying that once the resource is running, starting it > >again doesn't produce an error? Did you also try to stop it and > >start it from the stopped state? > > > > Yes, but probably because the script just check that IPv6 is set, and so > don't try to set it again. If I stop and start again, the error occurs. > > >>>>For others errors, I disable stonith for the moment, and DRBD is built > >>>>in kernel, so the drbd module is not needed. I've seen this message, > >>>>but it's not a problem. > >>>There's a small problem with the stonith suicide agent, which > >>>renders it unusable, but it is soon to be fixed. > >>> > >>OK, that's what I had read on this list, but I wasn't sure. Is there is > >>any patch now ? > >> > >>>>I joined log and config, and core file about stonith > >>>>(/var/lib/heartbeat/cores/root/core.3668). Is it what you asked for > >>>>(backtrace from stonith core dump) ? > >>>You shouldn't be sending core dumps to a public list: it may > >>>contain sensitive information. What I asked for, a backtrace, you > >>>get like this: > >>> > >>>$ gdb /usr/lib64/heartbeat/stonithd core.3668 > >>>(gdb) bt > >>>... < here comes the backtrace > >>>(gdb) quit > >>> > >>Ooops ! Here it is : > >> > >>#0 0x00000039b9d03507 in stonith_free_hostlist () from > >>/usr/lib64/libstonith.so.1 > >>#1 0x0000000000408a95 in ?? () > >>#2 0x0000000000407fee in ?? () > >>#3 0x00000000004073c3 in ?? () > >>#4 0x000000000040539d in ?? () > >>#5 0x0000000000405015 in ?? () > >>#6 0x00000039b950abd4 in G_CH_dispatch_int () from > >>/usr/lib64/libplumb.so.1 > >>#7 0x0000003a12a266bd in g_main_context_dispatch () from > >>/usr/lib64/libglib-2.0.so.0 > >>#8 0x0000003a12a28397 in g_main_context_acquire () from > >>/usr/lib64/libglib-2.0.so.0 > >>#9 0x0000003a12a28735 in g_main_loop_run () from > >>/usr/lib64/libglib-2.0.so.0 > >>#10 0x000000000040341a in ?? () > >>#11 0x0000003a0fd1c4ca in __libc_start_main () from /lib64/tls/libc.so.6 > >>#12 0x000000000040303a in ?? () > >>#13 0x00007fff0f04b8d8 in ?? () > >>#14 0x000000000000001c in ?? () > >>#15 0x0000000000000001 in ?? () > >>#16 0x00007fff0f04cb73 in ?? () > >>#17 0x0000000000000000 in ?? () > >> > >>Thanks for taking time to explain me some basics... > > > >You're welcome. As Andrew suggested, you should file bugs for both > >of these. Interestingly, all those question marks mean that some > >debugging info is missing, but then there is some in libplumb.so. > >Odd. Where did you say your heartbeat package comes from? > > > heartbeat-2.0.8-2.el4.centos.x86_64.rpm > heartbeat-gui-2.0.8-2.el4.centos.x86_64.rpm > heartbeat-pils-2.0.8-2.el4.centos.x86_64.rpm > heartbeat-stonith-2.0.8-2.el4.centos.x86_64.rpm > > From here IIRC : > http://dev.centos.org/centos/4/testing/x86_64/RPMS/ > > > >>Ben > >> > >> > >>>Thanks. > >>> > >>>>Thanks a lot for helping. > >>>> > >>>>Ben > >>>> > >>>>Dejan Muhamedagic a écrit : > >>>>>On Tue, Apr 24, 2007 at 06:36:04PM +0200, Benjamin Watine wrote: > >>>>>>Thank you, Dejan, for replying (I feel less alone now !) > >>>>>That's good. > >>>>> > >>>>>>I've applied location constraint to only one resource (slapd) as you > >>>>>>suggest me, but it still doesn't work as expected. > >>>>>> > >>>>>>I've read in the list archive that I have to sum all resources > >>>>>>stickiness of the group to get the real calculated resource > >>>>>>stickiness. It makes sense, let's do it : > >>>>>> > >>>>>>resource-stickiness at 100, > >>>>>>resource-failure-stickiness at -400 > >>>>>>score on slapd on node1 (castor) at 1600 > >>>>>>score on slapd on node2 (pollux) at 1000 > >>>>>> > >>>>>>If I apply given relation (I have 6 resources in my group): > >>>>>>((1600 - 1000) + (6 * 100))) / 400 = 3 > >>>>>> > >>>>>>slapd should start on castor, and failback to pollux if it fails 3 > >>>>>>times, right ? > >>>>>> > >>>>>>But now my resource doesn't start at all. In the log, I can see : > >>>>>> > >>>>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource > >>>>>>ldap_drbddisk cannot run anywhere > >>>>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource > >>>>>>ldap_Filesystem cannot run anywhere > >>>>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource > >>>>>>ldap_IPaddr_193_48_169_47 cannot run anywhere > >>>>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource > >>>>>>IPv6addr_ldap cannot run anywhere > >>>>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource > >>>>>>ldap_slapd cannot run anywhere > >>>>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource > >>>>>>ldap_MailTo cannot run anywhere > >>>>>> > >>>>>>What does I do wrong ?? Is there an exhaustive documentation about > >>>>>>scores and location ? It's complex to do, but what I want is really > >>>>>>simple and common : "start on this node first and failover on other > >>>>>>node if you fail 3 times". > >>>>>> > >>>>>>Log and config attached. > >>>>>Log is OK. Several things in there: > >>>>> > >>>>>Apr 24 17:54:53 pollux tengine: [4168]: ERROR: > >>>>>stonithd_op_result_ready: failed due to not on signon status. > >>>>>Apr 24 17:54:53 pollux tengine: [4168]: ERROR: > >>>>>tengine_stonith_connection_destroy: Fencing daemon has left us > >>>>>Apr 24 17:54:53 pollux heartbeat: [3574]: ERROR: Exiting > >>>>>/usr/lib64/heartbeat/stonithd process 3668 dumped core > >>>>> > >>>>>Perhaps you could give us a backtrace from this core dump. > >>>>> > >>>>>Apr 24 17:56:08 pollux drbd: ERROR: Module drbd does not exist in > >>>>>/proc/modules > >>>>> > >>>>>A drbd setup problem? > >>>>> > >>>>>Apr 24 17:54:53 pollux pengine: [4169]: ERROR: can_run_resources: No > >>>>>node supplied > >>>>> > >>>>>This is an interesting message. Can't find it in the development code. > >>>>> > >>>>>Apr 24 18:00:26 pollux IPv6addr: [4365]: ERROR: no valid mecahnisms > >>>>>Apr 24 18:00:26 pollux crmd: [3647]: ERROR: process_lrm_event: LRM > >>>>>operation IPv6addr_ldap_start_0 (call=22, rc=1) Error unknown error > >>>>> > >>>>>Can I suggest to first have a regular working configuration which > >>>>>includes all the resources and a sane and well behaving cluster? > >>>>>Then we see if fiddling with constraints gives any results. > >>>>> > >>>>>Thanks. > >>>>> > >>>>>>Thanks ! > >>>>>> > >>>>>>Ben > >>>>>> > >>>>>>Dejan Muhamedagic a écrit : > >>>>>>>On Fri, Apr 20, 2007 at 03:04:56PM +0200, Benjamin Watine wrote: > >>>>>>>>Hi the list > >>>>>>>> > >>>>>>>>I'm trying to set location constraint for 2 resources group, but I > >>>>>>>>don't understand very well how it works. > >>>>>>>>I want to define a prefered node for each group, and tell HeartBeat > >>>>>>>>to move the group on the other node if 3 resources fail (and > >>>>>>>>restart) occurs. > >>>>>>>> > >>>>>>>>So, I defined default-resource-stickiness at 100, > >>>>>>>>default-resource-failure-stickiness at -100, and put a score of > >>>>>>>>1200 on prefered node, and 1000 for "second" node. > >>>>>>>>((1200-1000+100)/100 = 3). > >>>>>>>> > >>>>>>>>I'm trying to do this for 2 group. If 3 fails occurs for the > >>>>>>>>resource of a group, all the group have to be moved to the other > >>>>>>>>node. Can I configure group location constraint as for resource ? > >>>>>>>>How can I get group failcount (if it make sense) ? > >>>>>>>I doubt that you can. The failcounts are only on a per primitive > >>>>>>>basis. Groups are just shorthand for order and colocation > >>>>>>>constraints. However, if you choose a resource which matters the > >>>>>>>most to you (ldap/web service) and make location constraints on > >>>>>>>them, the other resources in the group will follow should it move > >>>>>>>to another node. > >>>>>>> > >>>>>>>>... and nothing works :p The resource group don't start on the good > >>>>>>>>node, and never failover if I manually stop 3 times a resource of > >>>>>>>>the group. > >>>>>>>> > >>>>>>>>Some light about this location constraints would be greatly > >>>>>>>>appreciated ! > >>>>>>>I'm afraid that I can't offer help on calculating the scores. There > >>>>>>>has been, however, extensive discussion on the list on the matter > >>>>>>>several months or a year ago. Perhaps you could search this list's > >>>>>>>archives. > >>>>>>> > >>>>>>>>cibadmin -Ql attached. > >>>>>>>I can only see in the status that the failcount for httpd_web is 5. > >>>>>>> > >>>>>>>Please include the logs, etc, see: > >>>>>>> > >>>>>>> http://linux-ha.org/ReportingProblems > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>>Thank you, in advance. > >>>>>>>> > >>>>>>>>Ben > >>_______________________________________________ > >>Linux-HA mailing list > >>[email protected] > >>http://lists.linux-ha.org/mailman/listinfo/linux-ha > >>See also: http://linux-ha.org/ReportingProblems > > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems -- Dejan
gethbbt.sh
Description: Bourne shell script
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
