On 30/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> On Thu, Nov 29, 2007 at 05:23:33PM +0000, Amos Shapira wrote:
> > On 29/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi,
> > >
> > > On Thu, Nov 29, 2007 at 10:25:47AM +0000, Amos Shapira wrote:
> > > > On 29/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > > > > Yes, very much so. For some reason the MCP (master control
> > > > > process) doesn't start the rest of the programs which are doing
> > > > > the real work. I really can't say why. Can you please attach the
> > > > > logs from this node?
> > > >
> > > > A pstree(1) on the better node visualizes the responsibility of
> > > > starting the programs pretty vividly:
> > > >
> > > >   |-heartbeat,18449
> > > >   |   |-attrd,18477
> > > >   |   |-ccm,18473
> > > >   |   |-cib,18474
> > > >   |   |-crmd,18478
> > > >   |   |   |-pengine,18505
> > > >   |   |   `-tengine,18504
> > > >   |   |-heartbeat,18452
> > > >   |   |-heartbeat,18453
> > > >   |   |-heartbeat,18454
> > > >   |   |-heartbeat,18455
> > > >   |   |-heartbeat,18456
> > > >   |   |-lrmd,18475 -r
> > > >   |   |-mgmtd,18479 -v
> > > >   |   `-stonithd,18476
> > > >
> > > > Here they are again (from tonight):
> > > >
> > > >       1 heartbeat[17481]: 2007/11/29_07:12:40 WARN: heartbeat: udp
> > > > port 695 reserved for service "ieee-mms-ssl".
> > > >       2 heartbeat[17481]: 2007/11/29_07:12:40 info: Version 2
> support:
> > > yes
> > > >       3 heartbeat[17481]: 2007/11/29_07:12:40 WARN: File
> > > > /etc/ha.d/haresources exists.
> > > >       4 heartbeat[17481]: 2007/11/29_07:12:40 WARN: This file is not
> > > > used because crm is enabled
> > > >       5 heartbeat[17481]: 2007/11/29_07:12:40 WARN: Logging daemon
> is
> > > > disabled --enabling logging daemon is recommended
> > > >       6 heartbeat[17481]: 2007/11/29_07:12:40 info:
> > > **************************
> > > >       7 heartbeat[17481]: 2007/11/29_07:12:40 info: Configuration
> > > > validated. Starting heartbeat 2.1.2
> > > >       8 heartbeat[17482]: 2007/11/29_07:12:40 info: heartbeat:
> version
> > > 2.1.2
> > > >       9 heartbeat[17482]: 2007/11/29_07:12:40 info: Heartbeat
> > > > generation: 1196102397
> > > >      10 heartbeat[17482]: 2007/11/29_07:12:40 info:
> > > > G_main_add_TriggerHandler: Added signal manual handler
> > > >      11 heartbeat[17482]: 2007/11/29_07:12:40 info:
> > > > G_main_add_TriggerHandler: Added signal manual handler
> > > >      12 heartbeat[17482]: 2007/11/29_07:12:40 info: Removing
> > > > /var/run/heartbeat/rsctmp failed, recreating.
> > > >      13 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> write
> > > > socket priority set to IPTOS_LOWDELAY on eth0
> > > >      14 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> bound
> > > > send socket to device: eth0
> > > >      15 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> bound
> > > > receive socket to device: eth0
> > > >      16 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > > > started on port 695 interface eth0 to 192.168.0.248
> > > >      17 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> write
> > > > socket priority set to IPTOS_LOWDELAY on eth0
> > > >      18 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> bound
> > > > send socket to device: eth0
> > > >      19 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> bound
> > > > receive socket to device: eth0
> > > >      20 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > > > started on port 695 interface eth0 to 192.168.0.249
> > > >      21 heartbeat[17482]: 2007/11/29_07:12:40 info:
> > > > G_main_add_SignalHandler: Added signal handler for signal 17
> > > >      22 heartbeat[17482]: 2007/11/29_07:12:40 info: Local status now
> > > > set to: 'up'
> > > >      23 heartbeat[17482]: 2007/11/29_07:12:41 info: Link
> > > > drbd01.test.spammatters.local:eth0 up.
> > > >      24 heartbeat[17482]: 2007/11/29_07:12:41 info: Status update
> for
> > > > node drbd01.test.spammatters.local: status up
> > > >      25 heartbeat[17482]: 2007/11/29_07:13:45 info: all clients are
> now
> > > paused
> > > >      26 heartbeat[17482]: 2007/11/29_07:13:45 debug: hist->ackseq =0
> > > >      27 heartbeat[17482]: 2007/11/29_07:13:45 debug: hist->lowseq
> =0,
> > > > hist->hiseq=101
> > > >      28 heartbeat[17482]: 2007/11/29_07:13:45 debug: expecting from
> > > > drbd01.test.spammatters.local
> > > >      29 heartbeat[17482]: 2007/11/29_07:13:45 debug: it's ackseq=0
> > >
> > > heartbeat is getting no packet acknowledgements from drbd01. It
> > > must be a communication problem. Looks like drbd02 doesn't see
> > > packets coming from drbd01, assuming that it's sending them,
> > > which it does if there are no errors reported in drbd01.
> >
> >
> > Wouldn't this be the case if crmd crashes? Could this be related to
> "stonith
> > -h" seg-faulting and the missing processes (crmd, cib, attrd, ccm, lrmd,
> > mgmtd, stonithd) which I can see on the other node?
>
> No. There's an IPC layer which is used by heartbeat (the process)
> only. If that doesn't work, it won't start other programs.


I did some more experimentation - I installed a third machine identical to
the second one but still get the same results.

One thing that I managed to change (on both the new machine and the previous
"secondary") is that by moving aside the content of
/usr/lib64/stonith/plugins/stonith2 and leaving only the "null" plugin in
there I could get rid of the "stonith -h" segmentation fault (and I don't
have any of the devices these plugins talk to anyway).

But still I don't see crmd and friends on any machine except for the
primary.

Anyway, you would definitely see error messages if a program
> can't be started.


Where should I look for it? The init.d script forward most everything into
/dev/null.

> I'll try again with the default port, in case this matters.
>
> No, it shouldn't matter.


Apparently it didn't matter :^)

Thanks very much for your time.

--Amos
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to