On Wed, Apr 25, 2007 at 05:59:12PM +0200, Benjamin Watine wrote:
> Dejan Muhamedagic a écrit :
> >On Wed, Apr 25, 2007 at 11:59:02AM +0200, Benjamin Watine wrote:
> >>You were true, it wasn't a score problem, but my IPv6 resource that 
> >>causes an error, and let the resource group unstarted.
> >>
> >>Without IPv6, all is OK, behaviour of Heartbeat fit my needs (start on 
> >>prefered node (castor), and failover after 3 fails). So, my problem is 
> >>IPv6 now.
> >>
> >>The script seems to have a problem :
> >>
> >># /etc/ha.d/resource.d/IPv6addr 2001:660:6301:301::47:1 start
> >>*** glibc detected *** free(): invalid next size (fast): 
> >>0x000000000050d340 ***
> >>/etc/ha.d/resource.d//hto-mapfuncs: line 51:  4764 Aborted 
> >>   $__SCRIPT_NAME start
> >>2007/04/25_11:43:29 ERROR:  Unknown error: 134
> >>ERROR:  Unknown error: 134
> >>
> >>but now, ifconfig show that IPv6 is well configured, but script exit 
> >>with error code.
> >
> >IPv6addr aborts, hence the exit code 134 (128+signo). Somebody
> >recently posted a set of patches for IPv6addr... Right, I'm cc-ing
> >this to Horms.
> >
> Thank you so much, I'm waiting for Horms so. I'll take a look to list 
> archive also.

BTW, wasn't there also a core dump for this case too? Could you do
a ls -R /var/lib/heartbeat/cores and check.

> >># ifconfig
> >>eth0      Lien encap:Ethernet  HWaddr 00:13:72:58:74:5F
> >>          inet adr:193.48.169.46  Bcast:193.48.169.63 
> >>Masque:255.255.255.224
> >>          adr inet6: 2001:660:6301:301:213:72ff:fe58:745f/64 Scope:Global
> >>          adr inet6: fe80::213:72ff:fe58:745f/64 Scope:Lien
> >>          adr inet6: 2001:660:6301:301::47:1/64 Scope:Global
> >>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >>          RX packets:3788 errors:0 dropped:0 overruns:0 frame:0
> >>          TX packets:3992 errors:0 dropped:0 overruns:0 carrier:0
> >>          collisions:0 lg file transmission:1000
> >>          RX bytes:450820 (440.2 KiB)  TX bytes:844188 (824.4 KiB)
> >>          Adresse de base:0xecc0 Mémoire:fe6e0000-fe700000
> >>
> >>And if I launch the script again, no error is returned :
> >>
> >># /etc/ha.d/resource.d/IPv6addr 2001:660:6301:301::47:1 start
> >>2007/04/25_11:45:23 INFO:  Success
> >>INFO:  Success

So, you're saying that once the resource is running, starting it
again doesn't produce an error? Did you also try to stop it and
start it from the stopped state?

> >>For others errors, I disable stonith for the moment, and DRBD is built 
> >>in kernel, so the drbd module is not needed. I've seen this message, but 
> >>it's not a problem.
> >
> >There's a small problem with the stonith suicide agent, which
> >renders it unusable, but it is soon to be fixed.
> >
> OK, that's what I had read on this list, but I wasn't sure. Is there is 
> any patch now ?
> 
> >>I joined log and config, and core file about stonith 
> >>(/var/lib/heartbeat/cores/root/core.3668). Is it what you asked for 
> >>(backtrace from stonith core dump) ?
> >
> >You shouldn't be sending core dumps to a public list: it may
> >contain sensitive information. What I asked for, a backtrace, you
> >get like this:
> >
> >$ gdb /usr/lib64/heartbeat/stonithd core.3668
> >(gdb) bt
> >...  < here comes the backtrace
> >(gdb) quit
> >
> 
> Ooops ! Here it is :
> 
> #0  0x00000039b9d03507 in stonith_free_hostlist () from 
> /usr/lib64/libstonith.so.1
> #1  0x0000000000408a95 in ?? ()
> #2  0x0000000000407fee in ?? ()
> #3  0x00000000004073c3 in ?? ()
> #4  0x000000000040539d in ?? ()
> #5  0x0000000000405015 in ?? ()
> #6  0x00000039b950abd4 in G_CH_dispatch_int () from /usr/lib64/libplumb.so.1
> #7  0x0000003a12a266bd in g_main_context_dispatch () from 
> /usr/lib64/libglib-2.0.so.0
> #8  0x0000003a12a28397 in g_main_context_acquire () from 
> /usr/lib64/libglib-2.0.so.0
> #9  0x0000003a12a28735 in g_main_loop_run () from 
> /usr/lib64/libglib-2.0.so.0
> #10 0x000000000040341a in ?? ()
> #11 0x0000003a0fd1c4ca in __libc_start_main () from /lib64/tls/libc.so.6
> #12 0x000000000040303a in ?? ()
> #13 0x00007fff0f04b8d8 in ?? ()
> #14 0x000000000000001c in ?? ()
> #15 0x0000000000000001 in ?? ()
> #16 0x00007fff0f04cb73 in ?? ()
> #17 0x0000000000000000 in ?? ()
> 
> Thanks for taking time to explain me some basics...

You're welcome. As Andrew suggested, you should file bugs for both
of these. Interestingly, all those question marks mean that some
debugging info is missing, but then there is some in libplumb.so.
Odd. Where did you say your heartbeat package comes from?

> 
> Ben
> 
> 
> >Thanks.
> >
> >>Thanks a lot for helping.
> >>
> >>Ben
> >>
> >>Dejan Muhamedagic a écrit :
> >>>On Tue, Apr 24, 2007 at 06:36:04PM +0200, Benjamin Watine wrote:
> >>>>Thank you, Dejan, for replying (I feel less alone now !)
> >>>That's good.
> >>>
> >>>>I've applied location constraint to only one resource (slapd) as you 
> >>>>suggest me, but it still doesn't work as expected.
> >>>>
> >>>>I've read in the list archive that I have to sum all resources 
> >>>>stickiness of the group to get the real calculated resource stickiness. 
> >>>>It makes sense, let's do it :
> >>>>
> >>>>resource-stickiness at 100,
> >>>>resource-failure-stickiness at -400
> >>>>score on slapd on node1 (castor) at 1600
> >>>>score on slapd on node2 (pollux) at 1000
> >>>>
> >>>>If I apply given relation (I have 6 resources in my group):
> >>>>((1600 - 1000) + (6 * 100))) / 400 = 3
> >>>>
> >>>>slapd should start on castor, and failback to pollux if it fails 3 
> >>>>times, right ?
> >>>>
> >>>>But now my resource doesn't start at all. In the log, I can see :
> >>>>
> >>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource 
> >>>>ldap_drbddisk cannot run anywhere
> >>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource 
> >>>>ldap_Filesystem cannot run anywhere
> >>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource 
> >>>>ldap_IPaddr_193_48_169_47 cannot run anywhere
> >>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource 
> >>>>IPv6addr_ldap cannot run anywhere
> >>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource 
> >>>>ldap_slapd cannot run anywhere
> >>>>Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource 
> >>>>ldap_MailTo cannot run anywhere
> >>>>
> >>>>What does I do wrong ?? Is there an exhaustive documentation about 
> >>>>scores and location ? It's complex to do, but what I want is really 
> >>>>simple and common : "start on this node first and failover on other 
> >>>>node if you fail 3 times".
> >>>>
> >>>>Log and config attached.
> >>>Log is OK. Several things in there:
> >>>
> >>>Apr 24 17:54:53 pollux tengine: [4168]: ERROR: stonithd_op_result_ready: 
> >>>failed due to not on signon status.
> >>>Apr 24 17:54:53 pollux tengine: [4168]: ERROR: 
> >>>tengine_stonith_connection_destroy: Fencing daemon has left us
> >>>Apr 24 17:54:53 pollux heartbeat: [3574]: ERROR: Exiting 
> >>>/usr/lib64/heartbeat/stonithd process 3668 dumped core
> >>>
> >>>Perhaps you could give us a backtrace from this core dump.
> >>>
> >>>Apr 24 17:56:08 pollux drbd: ERROR: Module drbd does not exist in 
> >>>/proc/modules
> >>>
> >>>A drbd setup problem?
> >>>
> >>>Apr 24 17:54:53 pollux pengine: [4169]: ERROR: can_run_resources: No 
> >>>node supplied
> >>>
> >>>This is an interesting message. Can't find it in the development code.
> >>>
> >>>Apr 24 18:00:26 pollux IPv6addr: [4365]: ERROR: no valid mecahnisms
> >>>Apr 24 18:00:26 pollux crmd: [3647]: ERROR: process_lrm_event: LRM 
> >>>operation IPv6addr_ldap_start_0 (call=22, rc=1) Error unknown error
> >>>
> >>>Can I suggest to first have a regular working configuration which
> >>>includes all the resources and a sane and well behaving cluster?
> >>>Then we see if fiddling with constraints gives any results.
> >>>
> >>>Thanks.
> >>>
> >>>>Thanks !
> >>>>
> >>>>Ben
> >>>>
> >>>>Dejan Muhamedagic a écrit :
> >>>>>On Fri, Apr 20, 2007 at 03:04:56PM +0200, Benjamin Watine wrote:
> >>>>>>Hi the list
> >>>>>>
> >>>>>>I'm trying to set location constraint for 2 resources group, but I 
> >>>>>>don't understand very well how it works.
> >>>>>>I want to define a prefered node for each group, and tell HeartBeat 
> >>>>>>to move the group on the other node if 3 resources fail (and restart) 
> >>>>>>occurs.
> >>>>>>
> >>>>>>So, I defined default-resource-stickiness at 100, 
> >>>>>>default-resource-failure-stickiness at -100, and put a score of 1200 
> >>>>>>on prefered node, and 1000 for "second" node. ((1200-1000+100)/100 = 
> >>>>>>3).
> >>>>>>
> >>>>>>I'm trying to do this for 2 group. If 3 fails occurs for the resource 
> >>>>>>of a group, all the group have to be moved to the other node. Can I 
> >>>>>>configure group location constraint as for resource ? How can I get 
> >>>>>>group failcount (if it make sense) ?
> >>>>>I doubt that you can. The failcounts are only on a per primitive
> >>>>>basis. Groups are just shorthand for order and colocation
> >>>>>constraints. However, if you choose a resource which matters the
> >>>>>most to you (ldap/web service) and make location constraints on
> >>>>>them, the other resources in the group will follow should it move
> >>>>>to another node.
> >>>>>
> >>>>>>... and nothing works :p The resource group don't start on the good 
> >>>>>>node, and never failover if I manually stop 3 times a resource of the 
> >>>>>>group.
> >>>>>>
> >>>>>>Some light about this location constraints would be greatly 
> >>>>>>appreciated !
> >>>>>I'm afraid that I can't offer help on calculating the scores. There
> >>>>>has been, however, extensive discussion on the list on the matter
> >>>>>several months or a year ago. Perhaps you could search this list's
> >>>>>archives.
> >>>>>
> >>>>>>cibadmin -Ql attached.
> >>>>>I can only see in the status that the failcount for httpd_web is 5.
> >>>>>
> >>>>>Please include the logs, etc, see:
> >>>>>
> >>>>>  http://linux-ha.org/ReportingProblems
> >>>>>
> >>>>>
> >>>>>
> >>>>>>Thank you, in advance.
> >>>>>>
> >>>>>>Ben
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

-- 
Dejan
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to