On 29/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > > Hi, > > On Thu, Nov 29, 2007 at 05:23:33PM +0000, Amos Shapira wrote: > > On 29/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > > > > > > Hi, > > > > > > On Thu, Nov 29, 2007 at 10:25:47AM +0000, Amos Shapira wrote: > > > > On 29/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > > > > > Yes, very much so. For some reason the MCP (master control > > > > > process) doesn't start the rest of the programs which are doing > > > > > the real work. I really can't say why. Can you please attach the > > > > > logs from this node? > > > > > > > > A pstree(1) on the better node visualizes the responsibility of > > > > starting the programs pretty vividly: > > > > > > > > |-heartbeat,18449 > > > > | |-attrd,18477 > > > > | |-ccm,18473 > > > > | |-cib,18474 > > > > | |-crmd,18478 > > > > | | |-pengine,18505 > > > > | | `-tengine,18504 > > > > | |-heartbeat,18452 > > > > | |-heartbeat,18453 > > > > | |-heartbeat,18454 > > > > | |-heartbeat,18455 > > > > | |-heartbeat,18456 > > > > | |-lrmd,18475 -r > > > > | |-mgmtd,18479 -v > > > > | `-stonithd,18476 > > > > > > > > Here they are again (from tonight): > > > > > > > > 1 heartbeat[17481]: 2007/11/29_07:12:40 WARN: heartbeat: udp > > > > port 695 reserved for service "ieee-mms-ssl". > > > > 2 heartbeat[17481]: 2007/11/29_07:12:40 info: Version 2 > support: > > > yes > > > > 3 heartbeat[17481]: 2007/11/29_07:12:40 WARN: File > > > > /etc/ha.d/haresources exists. > > > > 4 heartbeat[17481]: 2007/11/29_07:12:40 WARN: This file is not > > > > used because crm is enabled > > > > 5 heartbeat[17481]: 2007/11/29_07:12:40 WARN: Logging daemon > is > > > > disabled --enabling logging daemon is recommended > > > > 6 heartbeat[17481]: 2007/11/29_07:12:40 info: > > > ************************** > > > > 7 heartbeat[17481]: 2007/11/29_07:12:40 info: Configuration > > > > validated. Starting heartbeat 2.1.2 > > > > 8 heartbeat[17482]: 2007/11/29_07:12:40 info: heartbeat: > version > > > 2.1.2 > > > > 9 heartbeat[17482]: 2007/11/29_07:12:40 info: Heartbeat > > > > generation: 1196102397 > > > > 10 heartbeat[17482]: 2007/11/29_07:12:40 info: > > > > G_main_add_TriggerHandler: Added signal manual handler > > > > 11 heartbeat[17482]: 2007/11/29_07:12:40 info: > > > > G_main_add_TriggerHandler: Added signal manual handler > > > > 12 heartbeat[17482]: 2007/11/29_07:12:40 info: Removing > > > > /var/run/heartbeat/rsctmp failed, recreating. > > > > 13 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > write > > > > socket priority set to IPTOS_LOWDELAY on eth0 > > > > 14 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > bound > > > > send socket to device: eth0 > > > > 15 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > bound > > > > receive socket to device: eth0 > > > > 16 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > > > > started on port 695 interface eth0 to 192.168.0.248 > > > > 17 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > write > > > > socket priority set to IPTOS_LOWDELAY on eth0 > > > > 18 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > bound > > > > send socket to device: eth0 > > > > 19 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > bound > > > > receive socket to device: eth0 > > > > 20 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > > > > started on port 695 interface eth0 to 192.168.0.249 > > > > 21 heartbeat[17482]: 2007/11/29_07:12:40 info: > > > > G_main_add_SignalHandler: Added signal handler for signal 17 > > > > 22 heartbeat[17482]: 2007/11/29_07:12:40 info: Local status now > > > > set to: 'up' > > > > 23 heartbeat[17482]: 2007/11/29_07:12:41 info: Link > > > > drbd01.test.spammatters.local:eth0 up. > > > > 24 heartbeat[17482]: 2007/11/29_07:12:41 info: Status update > for > > > > node drbd01.test.spammatters.local: status up > > > > 25 heartbeat[17482]: 2007/11/29_07:13:45 info: all clients are > now > > > paused > > > > 26 heartbeat[17482]: 2007/11/29_07:13:45 debug: hist->ackseq =0 > > > > 27 heartbeat[17482]: 2007/11/29_07:13:45 debug: hist->lowseq > =0, > > > > hist->hiseq=101 > > > > 28 heartbeat[17482]: 2007/11/29_07:13:45 debug: expecting from > > > > drbd01.test.spammatters.local > > > > 29 heartbeat[17482]: 2007/11/29_07:13:45 debug: it's ackseq=0 > > > > > > heartbeat is getting no packet acknowledgements from drbd01. It > > > must be a communication problem. Looks like drbd02 doesn't see > > > packets coming from drbd01, assuming that it's sending them, > > > which it does if there are no errors reported in drbd01. > > > > > > Wouldn't this be the case if crmd crashes? Could this be related to > "stonith > > -h" seg-faulting and the missing processes (crmd, cib, attrd, ccm, lrmd, > > mgmtd, stonithd) which I can see on the other node? > > No. There's an IPC layer which is used by heartbeat (the process) > only. If that doesn't work, it won't start other programs. > Anyway, you would definitely see error messages if a program > can't be started.
So what could explain the missing processes? I guess it's not normal that all I see are a few "heartbeat" and "ha_logd" processes, is it? Also - doesn't the fact that "stonith" explodes even on a simple "-h" indicate a problem? > I'll try again with the default port, in case this matters. > > No, it shouldn't matter. "SHOULDN'T" is the problem :^). Anyway I tried it, as well as a re-install of the packages after removing all the intermediate files, and still get the same results. One curios thing is that when running the entire "service heartbeat start" process on drbd02 (the bad node) under strace seems to allow crmd to run and crm_mon even to connect and give info, but it could only find itself and not the good node (drbd01). And anyway this isn't a solution for a production system. Thanks, --Amos _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
