Hi,

On Thu, Nov 29, 2007 at 05:23:33PM +0000, Amos Shapira wrote:
> On 29/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> >
> > On Thu, Nov 29, 2007 at 10:25:47AM +0000, Amos Shapira wrote:
> > > On 29/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > > > Yes, very much so. For some reason the MCP (master control
> > > > process) doesn't start the rest of the programs which are doing
> > > > the real work. I really can't say why. Can you please attach the
> > > > logs from this node?
> > >
> > > A pstree(1) on the better node visualizes the responsibility of
> > > starting the programs pretty vividly:
> > >
> > >   |-heartbeat,18449
> > >   |   |-attrd,18477
> > >   |   |-ccm,18473
> > >   |   |-cib,18474
> > >   |   |-crmd,18478
> > >   |   |   |-pengine,18505
> > >   |   |   `-tengine,18504
> > >   |   |-heartbeat,18452
> > >   |   |-heartbeat,18453
> > >   |   |-heartbeat,18454
> > >   |   |-heartbeat,18455
> > >   |   |-heartbeat,18456
> > >   |   |-lrmd,18475 -r
> > >   |   |-mgmtd,18479 -v
> > >   |   `-stonithd,18476
> > >
> > > Here they are again (from tonight):
> > >
> > >       1 heartbeat[17481]: 2007/11/29_07:12:40 WARN: heartbeat: udp
> > > port 695 reserved for service "ieee-mms-ssl".
> > >       2 heartbeat[17481]: 2007/11/29_07:12:40 info: Version 2 support:
> > yes
> > >       3 heartbeat[17481]: 2007/11/29_07:12:40 WARN: File
> > > /etc/ha.d/haresources exists.
> > >       4 heartbeat[17481]: 2007/11/29_07:12:40 WARN: This file is not
> > > used because crm is enabled
> > >       5 heartbeat[17481]: 2007/11/29_07:12:40 WARN: Logging daemon is
> > > disabled --enabling logging daemon is recommended
> > >       6 heartbeat[17481]: 2007/11/29_07:12:40 info:
> > **************************
> > >       7 heartbeat[17481]: 2007/11/29_07:12:40 info: Configuration
> > > validated. Starting heartbeat 2.1.2
> > >       8 heartbeat[17482]: 2007/11/29_07:12:40 info: heartbeat: version
> > 2.1.2
> > >       9 heartbeat[17482]: 2007/11/29_07:12:40 info: Heartbeat
> > > generation: 1196102397
> > >      10 heartbeat[17482]: 2007/11/29_07:12:40 info:
> > > G_main_add_TriggerHandler: Added signal manual handler
> > >      11 heartbeat[17482]: 2007/11/29_07:12:40 info:
> > > G_main_add_TriggerHandler: Added signal manual handler
> > >      12 heartbeat[17482]: 2007/11/29_07:12:40 info: Removing
> > > /var/run/heartbeat/rsctmp failed, recreating.
> > >      13 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: write
> > > socket priority set to IPTOS_LOWDELAY on eth0
> > >      14 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound
> > > send socket to device: eth0
> > >      15 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound
> > > receive socket to device: eth0
> > >      16 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > > started on port 695 interface eth0 to 192.168.0.248
> > >      17 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: write
> > > socket priority set to IPTOS_LOWDELAY on eth0
> > >      18 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound
> > > send socket to device: eth0
> > >      19 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: bound
> > > receive socket to device: eth0
> > >      20 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > > started on port 695 interface eth0 to 192.168.0.249
> > >      21 heartbeat[17482]: 2007/11/29_07:12:40 info:
> > > G_main_add_SignalHandler: Added signal handler for signal 17
> > >      22 heartbeat[17482]: 2007/11/29_07:12:40 info: Local status now
> > > set to: 'up'
> > >      23 heartbeat[17482]: 2007/11/29_07:12:41 info: Link
> > > drbd01.test.spammatters.local:eth0 up.
> > >      24 heartbeat[17482]: 2007/11/29_07:12:41 info: Status update for
> > > node drbd01.test.spammatters.local: status up
> > >      25 heartbeat[17482]: 2007/11/29_07:13:45 info: all clients are now
> > paused
> > >      26 heartbeat[17482]: 2007/11/29_07:13:45 debug: hist->ackseq =0
> > >      27 heartbeat[17482]: 2007/11/29_07:13:45 debug: hist->lowseq =0,
> > > hist->hiseq=101
> > >      28 heartbeat[17482]: 2007/11/29_07:13:45 debug: expecting from
> > > drbd01.test.spammatters.local
> > >      29 heartbeat[17482]: 2007/11/29_07:13:45 debug: it's ackseq=0
> >
> > heartbeat is getting no packet acknowledgements from drbd01. It
> > must be a communication problem. Looks like drbd02 doesn't see
> > packets coming from drbd01, assuming that it's sending them,
> > which it does if there are no errors reported in drbd01.
> 
> 
> Wouldn't this be the case if crmd crashes? Could this be related to "stonith
> -h" seg-faulting and the missing processes (crmd, cib, attrd, ccm, lrmd,
> mgmtd, stonithd) which I can see on the other node?

No. There's an IPC layer which is used by heartbeat (the process)
only. If that doesn't work, it won't start other programs.
Anyway, you would definitely see error messages if a program
can't be started.

> I'll try again with the default port, in case this matters.

No, it shouldn't matter.

Thanks,

Dejan

> Thanks.
> 
> --Amos
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to