its probably worth creating a bug for both of these: http://old.linux-foundation.org/developer_bugzilla/enter_bug.cgi
On 4/25/07, Benjamin Watine <[EMAIL PROTECTED]> wrote:
Dejan Muhamedagic a écrit : > On Wed, Apr 25, 2007 at 11:59:02AM +0200, Benjamin Watine wrote: >> You were true, it wasn't a score problem, but my IPv6 resource that >> causes an error, and let the resource group unstarted. >> >> Without IPv6, all is OK, behaviour of Heartbeat fit my needs (start on >> prefered node (castor), and failover after 3 fails). So, my problem is >> IPv6 now. >> >> The script seems to have a problem : >> >> # /etc/ha.d/resource.d/IPv6addr 2001:660:6301:301::47:1 start >> *** glibc detected *** free(): invalid next size (fast): >> 0x000000000050d340 *** >> /etc/ha.d/resource.d//hto-mapfuncs: line 51: 4764 Aborted >> $__SCRIPT_NAME start >> 2007/04/25_11:43:29 ERROR: Unknown error: 134 >> ERROR: Unknown error: 134 >> >> but now, ifconfig show that IPv6 is well configured, but script exit >> with error code. > > IPv6addr aborts, hence the exit code 134 (128+signo). Somebody > recently posted a set of patches for IPv6addr... Right, I'm cc-ing > this to Horms. > Thank you so much, I'm waiting for Horms so. I'll take a look to list archive also. >> # ifconfig >> eth0 Lien encap:Ethernet HWaddr 00:13:72:58:74:5F >> inet adr:193.48.169.46 Bcast:193.48.169.63 >> Masque:255.255.255.224 >> adr inet6: 2001:660:6301:301:213:72ff:fe58:745f/64 Scope:Global >> adr inet6: fe80::213:72ff:fe58:745f/64 Scope:Lien >> adr inet6: 2001:660:6301:301::47:1/64 Scope:Global >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:3788 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:3992 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 lg file transmission:1000 >> RX bytes:450820 (440.2 KiB) TX bytes:844188 (824.4 KiB) >> Adresse de base:0xecc0 Mémoire:fe6e0000-fe700000 >> >> And if I launch the script again, no error is returned : >> >> # /etc/ha.d/resource.d/IPv6addr 2001:660:6301:301::47:1 start >> 2007/04/25_11:45:23 INFO: Success >> INFO: Success >> >> For others errors, I disable stonith for the moment, and DRBD is built >> in kernel, so the drbd module is not needed. I've seen this message, but >> it's not a problem. > > There's a small problem with the stonith suicide agent, which > renders it unusable, but it is soon to be fixed. > OK, that's what I had read on this list, but I wasn't sure. Is there is any patch now ? >> I joined log and config, and core file about stonith >> (/var/lib/heartbeat/cores/root/core.3668). Is it what you asked for >> (backtrace from stonith core dump) ? > > You shouldn't be sending core dumps to a public list: it may > contain sensitive information. What I asked for, a backtrace, you > get like this: > > $ gdb /usr/lib64/heartbeat/stonithd core.3668 > (gdb) bt > ... < here comes the backtrace > (gdb) quit > Ooops ! Here it is : #0 0x00000039b9d03507 in stonith_free_hostlist () from /usr/lib64/libstonith.so.1 #1 0x0000000000408a95 in ?? () #2 0x0000000000407fee in ?? () #3 0x00000000004073c3 in ?? () #4 0x000000000040539d in ?? () #5 0x0000000000405015 in ?? () #6 0x00000039b950abd4 in G_CH_dispatch_int () from /usr/lib64/libplumb.so.1 #7 0x0000003a12a266bd in g_main_context_dispatch () from /usr/lib64/libglib-2.0.so.0 #8 0x0000003a12a28397 in g_main_context_acquire () from /usr/lib64/libglib-2.0.so.0 #9 0x0000003a12a28735 in g_main_loop_run () from /usr/lib64/libglib-2.0.so.0 #10 0x000000000040341a in ?? () #11 0x0000003a0fd1c4ca in __libc_start_main () from /lib64/tls/libc.so.6 #12 0x000000000040303a in ?? () #13 0x00007fff0f04b8d8 in ?? () #14 0x000000000000001c in ?? () #15 0x0000000000000001 in ?? () #16 0x00007fff0f04cb73 in ?? () #17 0x0000000000000000 in ?? () Thanks for taking time to explain me some basics... Ben > Thanks. > >> Thanks a lot for helping. >> >> Ben >> >> Dejan Muhamedagic a écrit : >>> On Tue, Apr 24, 2007 at 06:36:04PM +0200, Benjamin Watine wrote: >>>> Thank you, Dejan, for replying (I feel less alone now !) >>> That's good. >>> >>>> I've applied location constraint to only one resource (slapd) as you >>>> suggest me, but it still doesn't work as expected. >>>> >>>> I've read in the list archive that I have to sum all resources >>>> stickiness of the group to get the real calculated resource stickiness. >>>> It makes sense, let's do it : >>>> >>>> resource-stickiness at 100, >>>> resource-failure-stickiness at -400 >>>> score on slapd on node1 (castor) at 1600 >>>> score on slapd on node2 (pollux) at 1000 >>>> >>>> If I apply given relation (I have 6 resources in my group): >>>> ((1600 - 1000) + (6 * 100))) / 400 = 3 >>>> >>>> slapd should start on castor, and failback to pollux if it fails 3 >>>> times, right ? >>>> >>>> But now my resource doesn't start at all. In the log, I can see : >>>> >>>> Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource >>>> ldap_drbddisk cannot run anywhere >>>> Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource >>>> ldap_Filesystem cannot run anywhere >>>> Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource >>>> ldap_IPaddr_193_48_169_47 cannot run anywhere >>>> Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource >>>> IPv6addr_ldap cannot run anywhere >>>> Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource >>>> ldap_slapd cannot run anywhere >>>> Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource >>>> ldap_MailTo cannot run anywhere >>>> >>>> What does I do wrong ?? Is there an exhaustive documentation about >>>> scores and location ? It's complex to do, but what I want is really >>>> simple and common : "start on this node first and failover on other node >>>> if you fail 3 times". >>>> >>>> Log and config attached. >>> Log is OK. Several things in there: >>> >>> Apr 24 17:54:53 pollux tengine: [4168]: ERROR: stonithd_op_result_ready: >>> failed due to not on signon status. >>> Apr 24 17:54:53 pollux tengine: [4168]: ERROR: >>> tengine_stonith_connection_destroy: Fencing daemon has left us >>> Apr 24 17:54:53 pollux heartbeat: [3574]: ERROR: Exiting >>> /usr/lib64/heartbeat/stonithd process 3668 dumped core >>> >>> Perhaps you could give us a backtrace from this core dump. >>> >>> Apr 24 17:56:08 pollux drbd: ERROR: Module drbd does not exist in >>> /proc/modules >>> >>> A drbd setup problem? >>> >>> Apr 24 17:54:53 pollux pengine: [4169]: ERROR: can_run_resources: No node >>> supplied >>> >>> This is an interesting message. Can't find it in the development code. >>> >>> Apr 24 18:00:26 pollux IPv6addr: [4365]: ERROR: no valid mecahnisms >>> Apr 24 18:00:26 pollux crmd: [3647]: ERROR: process_lrm_event: LRM >>> operation IPv6addr_ldap_start_0 (call=22, rc=1) Error unknown error >>> >>> Can I suggest to first have a regular working configuration which >>> includes all the resources and a sane and well behaving cluster? >>> Then we see if fiddling with constraints gives any results. >>> >>> Thanks. >>> >>>> Thanks ! >>>> >>>> Ben >>>> >>>> Dejan Muhamedagic a écrit : >>>>> On Fri, Apr 20, 2007 at 03:04:56PM +0200, Benjamin Watine wrote: >>>>>> Hi the list >>>>>> >>>>>> I'm trying to set location constraint for 2 resources group, but I >>>>>> don't understand very well how it works. >>>>>> I want to define a prefered node for each group, and tell HeartBeat to >>>>>> move the group on the other node if 3 resources fail (and restart) >>>>>> occurs. >>>>>> >>>>>> So, I defined default-resource-stickiness at 100, >>>>>> default-resource-failure-stickiness at -100, and put a score of 1200 on >>>>>> prefered node, and 1000 for "second" node. ((1200-1000+100)/100 = 3). >>>>>> >>>>>> I'm trying to do this for 2 group. If 3 fails occurs for the resource >>>>>> of a group, all the group have to be moved to the other node. Can I >>>>>> configure group location constraint as for resource ? How can I get >>>>>> group failcount (if it make sense) ? >>>>> I doubt that you can. The failcounts are only on a per primitive >>>>> basis. Groups are just shorthand for order and colocation >>>>> constraints. However, if you choose a resource which matters the >>>>> most to you (ldap/web service) and make location constraints on >>>>> them, the other resources in the group will follow should it move >>>>> to another node. >>>>> >>>>>> ... and nothing works :p The resource group don't start on the good >>>>>> node, and never failover if I manually stop 3 times a resource of the >>>>>> group. >>>>>> >>>>>> Some light about this location constraints would be greatly appreciated >>>>>> ! >>>>> I'm afraid that I can't offer help on calculating the scores. There >>>>> has been, however, extensive discussion on the list on the matter >>>>> several months or a year ago. Perhaps you could search this list's >>>>> archives. >>>>> >>>>>> cibadmin -Ql attached. >>>>> I can only see in the status that the failcount for httpd_web is 5. >>>>> >>>>> Please include the logs, etc, see: >>>>> >>>>> http://linux-ha.org/ReportingProblems >>>>> >>>>> >>>>> >>>>>> Thank you, in advance. >>>>>> >>>>>> Ben _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
