On 29/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> On Thu, Nov 29, 2007 at 10:25:47AM +0000, Amos Shapira wrote:
> > On 29/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > > Yes, very much so. For some reason the MCP (master control
> > > process) doesn't start the rest of the programs which are doing
> > > the real work. I really can't say why. Can you please attach the
> > > logs from this node?
> >
> > A pstree(1) on the better node visualizes the responsibility of
> > starting the programs pretty vividly:
> >
> >   |-heartbeat,18449
> >   |   |-attrd,18477
> >   |   |-ccm,18473
> >   |   |-cib,18474
> >   |   |-crmd,18478
> >   |   |   |-pengine,18505
> >   |   |   `-tengine,18504
> >   |   |-heartbeat,18452
> >   |   |-heartbeat,18453
> >   |   |-heartbeat,18454
> >   |   |-heartbeat,18455
> >   |   |-heartbeat,18456
> >   |   |-lrmd,18475 -r
> >   |   |-mgmtd,18479 -v
> >   |   `-stonithd,18476
> >
> > Here they are again (from tonight):
> >
> >       1 heartbeat[17481]: 2007/11/29_07:12:40 WARN: heartbeat: udp
> > port 695 reserved for service "ieee-mms-ssl".
> >       2 heartbeat[17481]: 2007/11/29_07:12:40 info: Version 2 support:
> yes
> >       3 heartbeat[17481]: 2007/11/29_07:12:40 WARN: File
> > /etc/ha.d/haresources exists.
> >       4 heartbeat[17481]: 2007/11/29_07:12:40 WARN: This file is not
> > used because crm is enabled
> >       5 heartbeat[17481]: 2007/11/29_07:12:40 WARN: Logging daemon is
> > disabled --enabling logging daemon is recommended
> >       6 heartbeat[17481]: 2007/11/29_07:12:40 info:
> **************************
> >       7 heartbeat[17481]: 2007/11/29_07:12:40 info: Configuration
> > validated. Starting heartbeat 2.1.2
> >       8 heartbeat[17482]: 2007/11/29_07:12:40 info: heartbeat: version
> 2.1.2
> >       9 heartbeat[17482]: 2007/11/29_07:12:40 info: Heartbeat
> > generation: 1196102397
> >      10 heartbeat[17482]: 2007/11/29_07:12:40 info:
> > G_main_add_TriggerHandler: Added signal manual handler
> >      11 heartbeat[17482]: 2007/11/29_07:12:40 info:
> > G_main_add_TriggerHandler: Added signal manual handler
> >      12 heartbeat[17482]: 2007/11/29_07:12:40 info: Removing
> > /var/run/heartbeat/rsctmp failed, recreating.
> >      13 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: write
> > socket priority set to IPTOS_LOWDELAY on eth0
> >      14 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound
> > send socket to device: eth0
> >      15 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound
> > receive socket to device: eth0
> >      16 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > started on port 695 interface eth0 to 192.168.0.248
> >      17 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: write
> > socket priority set to IPTOS_LOWDELAY on eth0
> >      18 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound
> > send socket to device: eth0
> >      19 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound
> > receive socket to device: eth0
> >      20 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > started on port 695 interface eth0 to 192.168.0.249
> >      21 heartbeat[17482]: 2007/11/29_07:12:40 info:
> > G_main_add_SignalHandler: Added signal handler for signal 17
> >      22 heartbeat[17482]: 2007/11/29_07:12:40 info: Local status now
> > set to: 'up'
> >      23 heartbeat[17482]: 2007/11/29_07:12:41 info: Link
> > drbd01.test.spammatters.local:eth0 up.
> >      24 heartbeat[17482]: 2007/11/29_07:12:41 info: Status update for
> > node drbd01.test.spammatters.local: status up
> >      25 heartbeat[17482]: 2007/11/29_07:13:45 info: all clients are now
> paused
> >      26 heartbeat[17482]: 2007/11/29_07:13:45 debug: hist->ackseq =0
> >      27 heartbeat[17482]: 2007/11/29_07:13:45 debug: hist->lowseq =0,
> > hist->hiseq=101
> >      28 heartbeat[17482]: 2007/11/29_07:13:45 debug: expecting from
> > drbd01.test.spammatters.local
> >      29 heartbeat[17482]: 2007/11/29_07:13:45 debug: it's ackseq=0
>
> heartbeat is getting no packet acknowledgements from drbd01. It
> must be a communication problem. Looks like drbd02 doesn't see
> packets coming from drbd01, assuming that it's sending them,
> which it does if there are no errors reported in drbd01.


Wouldn't this be the case if crmd crashes? Could this be related to "stonith
-h" seg-faulting and the missing processes (crmd, cib, attrd, ccm, lrmd,
mgmtd, stonithd) which I can see on the other node?

I'll try again with the default port, in case this matters.

Thanks.

--Amos
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to