Dejan Muhamedagic a écrit :
> On Wed, Apr 25, 2007 at 11:59:02AM +0200, Benjamin Watine wrote:
>> You were true, it wasn't a score problem, but my IPv6 resource that
>> causes an error, and let the resource group unstarted.
>>
>> Without IPv6, all is OK, behaviour of Heartbeat fit my needs (start on
>> prefered node (castor), and failover after 3 fails). So, my problem is
>> IPv6 now.
>>
>> The script seems to have a problem :
>>
>> # /etc/ha.d/resource.d/IPv6addr 2001:660:6301:301::47:1 start
>> *** glibc detected *** free(): invalid next size (fast):
>> 0x000000000050d340 ***
>> /etc/ha.d/resource.d//hto-mapfuncs: line 51: 4764 Aborted
>> $__SCRIPT_NAME start
>> 2007/04/25_11:43:29 ERROR: Unknown error: 134
>> ERROR: Unknown error: 134
>>
>> but now, ifconfig show that IPv6 is well configured, but script exit
>> with error code.
>
> IPv6addr aborts, hence the exit code 134 (128+signo). Somebody
> recently posted a set of patches for IPv6addr... Right, I'm cc-ing
> this to Horms.
>
Thank you so much, I'm waiting for Horms so. I'll take a look to list
archive also.
>> # ifconfig
>> eth0 Lien encap:Ethernet HWaddr 00:13:72:58:74:5F
>> inet adr:193.48.169.46 Bcast:193.48.169.63
>> Masque:255.255.255.224
>> adr inet6: 2001:660:6301:301:213:72ff:fe58:745f/64
Scope:Global
>> adr inet6: fe80::213:72ff:fe58:745f/64 Scope:Lien
>> adr inet6: 2001:660:6301:301::47:1/64 Scope:Global
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:3788 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:3992 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 lg file transmission:1000
>> RX bytes:450820 (440.2 KiB) TX bytes:844188 (824.4 KiB)
>> Adresse de base:0xecc0 Mémoire:fe6e0000-fe700000
>>
>> And if I launch the script again, no error is returned :
>>
>> # /etc/ha.d/resource.d/IPv6addr 2001:660:6301:301::47:1 start
>> 2007/04/25_11:45:23 INFO: Success
>> INFO: Success
>>
>> For others errors, I disable stonith for the moment, and DRBD is built
>> in kernel, so the drbd module is not needed. I've seen this
message, but
>> it's not a problem.
>
> There's a small problem with the stonith suicide agent, which
> renders it unusable, but it is soon to be fixed.
>
OK, that's what I had read on this list, but I wasn't sure. Is there is
any patch now ?
>> I joined log and config, and core file about stonith
>> (/var/lib/heartbeat/cores/root/core.3668). Is it what you asked for
>> (backtrace from stonith core dump) ?
>
> You shouldn't be sending core dumps to a public list: it may
> contain sensitive information. What I asked for, a backtrace, you
> get like this:
>
> $ gdb /usr/lib64/heartbeat/stonithd core.3668
> (gdb) bt
> ... < here comes the backtrace
> (gdb) quit
>
Ooops ! Here it is :
#0 0x00000039b9d03507 in stonith_free_hostlist () from
/usr/lib64/libstonith.so.1
#1 0x0000000000408a95 in ?? ()
#2 0x0000000000407fee in ?? ()
#3 0x00000000004073c3 in ?? ()
#4 0x000000000040539d in ?? ()
#5 0x0000000000405015 in ?? ()
#6 0x00000039b950abd4 in G_CH_dispatch_int () from
/usr/lib64/libplumb.so.1
#7 0x0000003a12a266bd in g_main_context_dispatch () from
/usr/lib64/libglib-2.0.so.0
#8 0x0000003a12a28397 in g_main_context_acquire () from
/usr/lib64/libglib-2.0.so.0
#9 0x0000003a12a28735 in g_main_loop_run () from
/usr/lib64/libglib-2.0.so.0
#10 0x000000000040341a in ?? ()
#11 0x0000003a0fd1c4ca in __libc_start_main () from /lib64/tls/libc.so.6
#12 0x000000000040303a in ?? ()
#13 0x00007fff0f04b8d8 in ?? ()
#14 0x000000000000001c in ?? ()
#15 0x0000000000000001 in ?? ()
#16 0x00007fff0f04cb73 in ?? ()
#17 0x0000000000000000 in ?? ()
Thanks for taking time to explain me some basics...
Ben
> Thanks.
>
>> Thanks a lot for helping.
>>
>> Ben
>>
>> Dejan Muhamedagic a écrit :
>>> On Tue, Apr 24, 2007 at 06:36:04PM +0200, Benjamin Watine wrote:
>>>> Thank you, Dejan, for replying (I feel less alone now !)
>>> That's good.
>>>
>>>> I've applied location constraint to only one resource (slapd) as you
>>>> suggest me, but it still doesn't work as expected.
>>>>
>>>> I've read in the list archive that I have to sum all resources
>>>> stickiness of the group to get the real calculated resource
stickiness.
>>>> It makes sense, let's do it :
>>>>
>>>> resource-stickiness at 100,
>>>> resource-failure-stickiness at -400
>>>> score on slapd on node1 (castor) at 1600
>>>> score on slapd on node2 (pollux) at 1000
>>>>
>>>> If I apply given relation (I have 6 resources in my group):
>>>> ((1600 - 1000) + (6 * 100))) / 400 = 3
>>>>
>>>> slapd should start on castor, and failback to pollux if it fails 3
>>>> times, right ?
>>>>
>>>> But now my resource doesn't start at all. In the log, I can see :
>>>>
>>>> Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource
>>>> ldap_drbddisk cannot run anywhere
>>>> Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource
>>>> ldap_Filesystem cannot run anywhere
>>>> Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource
>>>> ldap_IPaddr_193_48_169_47 cannot run anywhere
>>>> Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource
>>>> IPv6addr_ldap cannot run anywhere
>>>> Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource
>>>> ldap_slapd cannot run anywhere
>>>> Apr 24 18:00:26 pollux pengine: [4145]: WARN: native_color: Resource
>>>> ldap_MailTo cannot run anywhere
>>>>
>>>> What does I do wrong ?? Is there an exhaustive documentation about
>>>> scores and location ? It's complex to do, but what I want is really
>>>> simple and common : "start on this node first and failover on
other node
>>>> if you fail 3 times".
>>>>
>>>> Log and config attached.
>>> Log is OK. Several things in there:
>>>
>>> Apr 24 17:54:53 pollux tengine: [4168]: ERROR:
stonithd_op_result_ready:
>>> failed due to not on signon status.
>>> Apr 24 17:54:53 pollux tengine: [4168]: ERROR:
>>> tengine_stonith_connection_destroy: Fencing daemon has left us
>>> Apr 24 17:54:53 pollux heartbeat: [3574]: ERROR: Exiting
>>> /usr/lib64/heartbeat/stonithd process 3668 dumped core
>>>
>>> Perhaps you could give us a backtrace from this core dump.
>>>
>>> Apr 24 17:56:08 pollux drbd: ERROR: Module drbd does not exist in
>>> /proc/modules
>>>
>>> A drbd setup problem?
>>>
>>> Apr 24 17:54:53 pollux pengine: [4169]: ERROR: can_run_resources:
No node
>>> supplied
>>>
>>> This is an interesting message. Can't find it in the development
code.
>>>
>>> Apr 24 18:00:26 pollux IPv6addr: [4365]: ERROR: no valid mecahnisms
>>> Apr 24 18:00:26 pollux crmd: [3647]: ERROR: process_lrm_event: LRM
>>> operation IPv6addr_ldap_start_0 (call=22, rc=1) Error unknown error
>>>
>>> Can I suggest to first have a regular working configuration which
>>> includes all the resources and a sane and well behaving cluster?
>>> Then we see if fiddling with constraints gives any results.
>>>
>>> Thanks.
>>>
>>>> Thanks !
>>>>
>>>> Ben
>>>>
>>>> Dejan Muhamedagic a écrit :
>>>>> On Fri, Apr 20, 2007 at 03:04:56PM +0200, Benjamin Watine wrote:
>>>>>> Hi the list
>>>>>>
>>>>>> I'm trying to set location constraint for 2 resources group, but I
>>>>>> don't understand very well how it works.
>>>>>> I want to define a prefered node for each group, and tell
HeartBeat to
>>>>>> move the group on the other node if 3 resources fail (and restart)
>>>>>> occurs.
>>>>>>
>>>>>> So, I defined default-resource-stickiness at 100,
>>>>>> default-resource-failure-stickiness at -100, and put a score of
1200 on
>>>>>> prefered node, and 1000 for "second" node. ((1200-1000+100)/100
= 3).
>>>>>>
>>>>>> I'm trying to do this for 2 group. If 3 fails occurs for the
resource
>>>>>> of a group, all the group have to be moved to the other node.
Can I
>>>>>> configure group location constraint as for resource ? How can I
get
>>>>>> group failcount (if it make sense) ?
>>>>> I doubt that you can. The failcounts are only on a per primitive
>>>>> basis. Groups are just shorthand for order and colocation
>>>>> constraints. However, if you choose a resource which matters the
>>>>> most to you (ldap/web service) and make location constraints on
>>>>> them, the other resources in the group will follow should it move
>>>>> to another node.
>>>>>
>>>>>> ... and nothing works :p The resource group don't start on the
good
>>>>>> node, and never failover if I manually stop 3 times a resource
of the
>>>>>> group.
>>>>>>
>>>>>> Some light about this location constraints would be greatly
appreciated
>>>>>> !
>>>>> I'm afraid that I can't offer help on calculating the scores. There
>>>>> has been, however, extensive discussion on the list on the matter
>>>>> several months or a year ago. Perhaps you could search this list's
>>>>> archives.
>>>>>
>>>>>> cibadmin -Ql attached.
>>>>> I can only see in the status that the failcount for httpd_web is 5.
>>>>>
>>>>> Please include the logs, etc, see:
>>>>>
>>>>> http://linux-ha.org/ReportingProblems
>>>>>
>>>>>
>>>>>
>>>>>> Thank you, in advance.
>>>>>>
>>>>>> Ben
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems