Re: [Linux-HA] Monitor Retry

Tony Nelson Wed, 30 Jan 2008 22:35:51 -0800

Hi All,

I had a few extra minutes and I upgraded to heartbeat to version 2.1.3and restarted both of my servers.

I have things working on just fine on the "master" server, but I haveseveral issues..


1) drbd seems to not be communicating..

on the master

thebrain:~ # cat /proc/drbd
version: 8.0.6 (api:86/proto:86)
SVN Revision: 3048 build by [EMAIL PROTECTED], 2007-09-03 10:39:27
 0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown   r---
    ns:0 nr:0 dw:10180 dr:39945 al:488 bm:488 lo:0 pe:0 ua:0 ap:0
        resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0

act_log: used:0/127 hits:2057 misses:505 starving:15 dirty:2 changed:488


on the slave

pinky:~ # cat /proc/drbd
version: 8.0.6 (api:86/proto:86)
SVN Revision: 3048 build by [EMAIL PROTECTED], 2007-09-03 10:39:27
 0: cs:StandAlone st:Secondary/Unknown ds:UpToDate/DUnknown   r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:106 lo:0 pe:0 ua:0 ap:0
        resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
        act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0


looking at the logs on the client .. I find

pinky:~ # grep drbd /var/log/messages | tail -20

Jan 31 01:12:33 pinky crmd: [4080]: info: process_lrm_event: LRMoperation drbd_sys:1_notify_0 (call=18, rc=0) completeJan 31 01:12:34 pinky crmd: [4080]: info: do_lrm_rsc_op: Performingop=drbd_sys:1_notify_0 key=75:20:ab1e279e-4466-4cee-a6b7-47c81743e85d)

Jan 31 01:12:34 pinky lrmd: [4077]: info: rsc:drbd_sys:1: notify

Jan 31 01:12:34 pinky drbd[4761]: [4772]: DEBUG: mirror notify: prefor promote - counts: active 0 - starting 2 - stopping 0Jan 31 01:12:34 pinky crmd: [4080]: info: process_lrm_event: LRMoperation drbd_sys:1_notify_0 (call=19, rc=0) completeJan 31 01:12:35 pinky crmd: [4080]: info: do_lrm_rsc_op: Performingop=drbd_sys:1_notify_0 key=76:20:ab1e279e-4466-4cee-a6b7-47c81743e85d)

Jan 31 01:12:35 pinky lrmd: [4077]: info: rsc:drbd_sys:1: notify

Jan 31 01:12:35 pinky drbd[4773]: [4784]: DEBUG: mirror notify: postfor promote - counts: active 0 - starting 2 - stopping 0Jan 31 01:12:35 pinky drbd[4773]: [4786]: DEBUG: mirror: Callingdrbdadm -c /etc/drbd.conf state mirror

Jan 31 01:12:35 pinky drbd[4773]: [4793]: DEBUG: mirror: Exit code 0

Jan 31 01:12:35 pinky drbd[4773]: [4794]: DEBUG: mirror: Commandoutput: Secondary/UnknownJan 31 01:12:35 pinky drbd[4773]: [4802]: DEBUG: mirror: Callingdrbdadm -c /etc/drbd.conf cstate mirror

Jan 31 01:12:35 pinky drbd[4773]: [4809]: DEBUG: mirror: Exit code 0

Jan 31 01:12:35 pinky drbd[4773]: [4810]: DEBUG: mirror: Commandoutput: StandAloneJan 31 01:12:35 pinky drbd[4773]: [4811]: DEBUG: mirror status:Secondary/Unknown Secondary Unknown StandAloneJan 31 01:12:35 pinky drbd[4773]: [4812]: DEBUG: mirror: Calling /usr/sbin/crm_master -v 5 -l reboot

Jan 31 01:12:37 pinky drbd[4773]: [4817]: DEBUG: mirror: Exit code 0

Jan 31 01:12:37 pinky drbd[4773]: [4818]: DEBUG: mirror: Commandoutput: No set matching id=master-6deaf16f-a98f-4c56-af00-a97be99f2e68in statusJan 31 01:12:37 pinky lrmd: [4077]: info: RA output: (drbd_sys:1:notify:stdout) No set matching id=master-6deaf16f-a98f-4c56-af00-a97be99f2e68 in statusJan 31 01:12:37 pinky crmd: [4080]: info: process_lrm_event: LRMoperation drbd_sys:1_notify_0 (call=20, rc=0) complete

I'm tired and google has failed me, I don't know where to startlooking for what's wrong.


The other problem I noticed, is that mgmtd is no longer starting

Jan 31 01:12:06 pinky heartbeat: [3821]: info: Starting child client "/usr/lib64/heartbeat/mgmtd -v" (0,0)Jan 31 01:12:06 pinky heartbeat: [4081]: info: Starting "/usr/lib64/heartbeat/mgmtd -v" as uid 0 gid 0 (pid 4081)Jan 31 01:12:06 pinky mgmtd: [4081]: info: G_main_add_SignalHandler:Added signal handler for signal 15

Jan 31 01:12:06 pinky mgmtd: [4081]: debug: Enabling coredumps

Jan 31 01:12:06 pinky mgmtd: [4081]: info: G_main_add_SignalHandler:Added signal handler for signal 10Jan 31 01:12:06 pinky mgmtd: [4081]: info: G_main_add_SignalHandler:Added signal handler for signal 12Jan 31 01:12:06 pinky heartbeat: [3821]: WARN: Client [mgmtd] pid 4081failed authorization [no default client auth]Jan 31 01:12:06 pinky heartbeat: [3821]: ERROR:api_process_registration_msg: cannot add client(mgmtd)Jan 31 01:12:06 pinky mgmtd: [4081]: ERROR: Cannot sign on withheartbeat

Jan 31 01:12:06 pinky mgmtd: [4081]: ERROR: REASON:

Jan 31 01:12:06 pinky mgmtd: [4081]: ERROR: Can't initializemanagement library.Shutting down.(-1)Jan 31 01:12:06 pinky heartbeat: [3821]: WARN: Managed /usr/lib64/heartbeat/mgmtd -v process 4081 exited with return code 1.Jan 31 01:12:06 pinky heartbeat: [3821]: ERROR: Respawning client "/usr/lib64/heartbeat/mgmtd -v":Jan 31 01:12:06 pinky heartbeat: [3821]: info: Starting child client "/usr/lib64/heartbeat/mgmtd -v" (0,0)Jan 31 01:12:07 pinky heartbeat: [4084]: info: Starting "/usr/lib64/heartbeat/mgmtd -v" as uid 0 gid 0 (pid 4084)Jan 31 01:12:07 pinky mgmtd: [4084]: info: G_main_add_SignalHandler:Added signal handler for signal 15

Jan 31 01:12:07 pinky mgmtd: [4084]: debug: Enabling coredumps

Jan 31 01:12:07 pinky mgmtd: [4084]: info: G_main_add_SignalHandler:Added signal handler for signal 10Jan 31 01:12:07 pinky mgmtd: [4084]: info: G_main_add_SignalHandler:Added signal handler for signal 12Jan 31 01:12:07 pinky heartbeat: [3821]: WARN: Client [mgmtd] pid 4084failed authorization [no default client auth]Jan 31 01:12:07 pinky heartbeat: [3821]: ERROR:api_process_registration_msg: cannot add client(mgmtd)Jan 31 01:12:07 pinky mgmtd: [4084]: ERROR: Cannot sign on withheartbeat

Jan 31 01:12:07 pinky mgmtd: [4084]: ERROR: REASON:

Jan 31 01:12:07 pinky mgmtd: [4084]: ERROR: Can't initializemanagement library.Shutting down.(-1)Jan 31 01:12:07 pinky heartbeat: [3821]: WARN: Managed /usr/lib64/heartbeat/mgmtd -v process 4084 exited with return code 1.Jan 31 01:12:07 pinky heartbeat: [3821]: ERROR: Respawning client "/usr/lib64/heartbeat/mgmtd -v":



I see this on both the main and the secondary server.

I found some info at

http://www.gossamer-threads.com/lists/linuxha/users/45403?do=post_view_threaded#45403

but nothing that seemed to really apply..

If anyone can offer any suggestions, I would really appreciate it.

Thank you very much
Tony Nelson
Starpoint Solutions





On Jan 30, 2008, at 10:05 AM, Andreas Mock wrote:

-----Ursprüngliche Nachricht-----
Von: General Linux-HA mailing list <[email protected]>
Gesendet: 30.01.08 13:44:53
An: General Linux-HA mailing list <[email protected]>
co-incidentally, i've been thinking about such a feature recently...
i'm inclined to think that this functionality should be in the LRM
(ie. its a threshold for escalating to the CRM).

thoughts?
My thoughts if anyone is interested:

The result of the monitor action should be:
a) Resource is running.
b) Resource is not running.

But does it imply that it is running healthy? As the result of
the monitor action determines what happens to the resource,
I would say "YES". So the question is:
Does the resource run in a way able to fulfill it's service? (Yes/No)
But IMHO this question implies that the RA tries/should try to do asmuch
as necessary to test the service-ability. This can be pretty much.
Sometimes too much if the service is doing what it should: Workinghard!
Timeout means: Nothing, no answer.
What shall I do with this information? As I said before: Shall Iassume
that something is wrong or everything is o.k.
Is everything o.k. if a service is producing so much load that I'mnot even
able to get the output of 'ps -ax'?
What is the difference between aksing once with long timeout andmultipletimes with short timeout. (If I measure the time, I could know inboth cases
that the RA need (too) long to get an answer).
IMHO not the timeout of the monitor action is the big problem, butthe possible
chain reaction you get after this:
monitor timeout because of heavy (regulsr) load => stop actiontriggered =>stop action times out (RAs try to do a graceful shutdown) because ofheavy load =>
MESS (node fencing or resource in unmanaged state).
My proposal: Make the timeout for monitor long enough. If timeoutoccures assumethat there is really something wrong because a "simple" monitoraction does not work.=> Stop resource. Implement a two-step stop action (probably in theRA itself):
1) Try a graceful stop of resource. (e.d. db shutdown)
2) After inner timeout stop/kill resoure brutal (if possible)
3) If this doesn't work, signal timeout to upper instance whichresults in known behaviour.
What would someone win: Kill one resource brutally with the hope allother resources
still remain intact.

Of course interested to hear other aspects. :-)

Best regards
Andreas Mock

_____________________________________________________________________
Unbegrenzter Speicherplatz für Ihr E-Mail Postfach? Jetzt aktivieren!
http://www.digitaledienste.web.de/freemail/club/lp/?lp=7

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Monitor Retry

Reply via email to