On Fri, Nov 30, 2007 at 05:16:38PM +1100, Amos Shapira wrote:
> On 30/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> >
> > On Thu, Nov 29, 2007 at 05:23:33PM +0000, Amos Shapira wrote:
> > > On 29/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Thu, Nov 29, 2007 at 10:25:47AM +0000, Amos Shapira wrote:
> > > > > On 29/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > > > > > Yes, very much so. For some reason the MCP (master control
> > > > > > process) doesn't start the rest of the programs which are doing
> > > > > > the real work. I really can't say why. Can you please attach the
> > > > > > logs from this node?
> > > > >
> > > > > A pstree(1) on the better node visualizes the responsibility of
> > > > > starting the programs pretty vividly:
> > > > >
> > > > >   |-heartbeat,18449
> > > > >   |   |-attrd,18477
> > > > >   |   |-ccm,18473
> > > > >   |   |-cib,18474
> > > > >   |   |-crmd,18478
> > > > >   |   |   |-pengine,18505
> > > > >   |   |   `-tengine,18504
> > > > >   |   |-heartbeat,18452
> > > > >   |   |-heartbeat,18453
> > > > >   |   |-heartbeat,18454
> > > > >   |   |-heartbeat,18455
> > > > >   |   |-heartbeat,18456
> > > > >   |   |-lrmd,18475 -r
> > > > >   |   |-mgmtd,18479 -v
> > > > >   |   `-stonithd,18476
> > > > >
> > > > > Here they are again (from tonight):
> > > > >
> > > > >       1 heartbeat[17481]: 2007/11/29_07:12:40 WARN: heartbeat: udp
> > > > > port 695 reserved for service "ieee-mms-ssl".
> > > > >       2 heartbeat[17481]: 2007/11/29_07:12:40 info: Version 2
> > support:
> > > > yes
> > > > >       3 heartbeat[17481]: 2007/11/29_07:12:40 WARN: File
> > > > > /etc/ha.d/haresources exists.
> > > > >       4 heartbeat[17481]: 2007/11/29_07:12:40 WARN: This file is not
> > > > > used because crm is enabled
> > > > >       5 heartbeat[17481]: 2007/11/29_07:12:40 WARN: Logging daemon
> > is
> > > > > disabled --enabling logging daemon is recommended
> > > > >       6 heartbeat[17481]: 2007/11/29_07:12:40 info:
> > > > **************************
> > > > >       7 heartbeat[17481]: 2007/11/29_07:12:40 info: Configuration
> > > > > validated. Starting heartbeat 2.1.2
> > > > >       8 heartbeat[17482]: 2007/11/29_07:12:40 info: heartbeat:
> > version
> > > > 2.1.2
> > > > >       9 heartbeat[17482]: 2007/11/29_07:12:40 info: Heartbeat
> > > > > generation: 1196102397
> > > > >      10 heartbeat[17482]: 2007/11/29_07:12:40 info:
> > > > > G_main_add_TriggerHandler: Added signal manual handler
> > > > >      11 heartbeat[17482]: 2007/11/29_07:12:40 info:
> > > > > G_main_add_TriggerHandler: Added signal manual handler
> > > > >      12 heartbeat[17482]: 2007/11/29_07:12:40 info: Removing
> > > > > /var/run/heartbeat/rsctmp failed, recreating.
> > > > >      13 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > write
> > > > > socket priority set to IPTOS_LOWDELAY on eth0
> > > > >      14 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > bound
> > > > > send socket to device: eth0
> > > > >      15 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > bound
> > > > > receive socket to device: eth0
> > > > >      16 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > > > > started on port 695 interface eth0 to 192.168.0.248
> > > > >      17 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > write
> > > > > socket priority set to IPTOS_LOWDELAY on eth0
> > > > >      18 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > bound
> > > > > send socket to device: eth0
> > > > >      19 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > bound
> > > > > receive socket to device: eth0
> > > > >      20 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > > > > started on port 695 interface eth0 to 192.168.0.249
> > > > >      21 heartbeat[17482]: 2007/11/29_07:12:40 info:
> > > > > G_main_add_SignalHandler: Added signal handler for signal 17
> > > > >      22 heartbeat[17482]: 2007/11/29_07:12:40 info: Local status now
> > > > > set to: 'up'
> > > > >      23 heartbeat[17482]: 2007/11/29_07:12:41 info: Link
> > > > > drbd01.test.spammatters.local:eth0 up.
> > > > >      24 heartbeat[17482]: 2007/11/29_07:12:41 info: Status update
> > for
> > > > > node drbd01.test.spammatters.local: status up
> > > > >      25 heartbeat[17482]: 2007/11/29_07:13:45 info: all clients are
> > now
> > > > paused
> > > > >      26 heartbeat[17482]: 2007/11/29_07:13:45 debug: hist->ackseq =0
> > > > >      27 heartbeat[17482]: 2007/11/29_07:13:45 debug: hist->lowseq
> > =0,
> > > > > hist->hiseq=101
> > > > >      28 heartbeat[17482]: 2007/11/29_07:13:45 debug: expecting from
> > > > > drbd01.test.spammatters.local
> > > > >      29 heartbeat[17482]: 2007/11/29_07:13:45 debug: it's ackseq=0
> > > >
> > > > heartbeat is getting no packet acknowledgements from drbd01. It
> > > > must be a communication problem. Looks like drbd02 doesn't see
> > > > packets coming from drbd01, assuming that it's sending them,
> > > > which it does if there are no errors reported in drbd01.
> > >
> > >
> > > Wouldn't this be the case if crmd crashes? Could this be related to
> > "stonith
> > > -h" seg-faulting and the missing processes (crmd, cib, attrd, ccm, lrmd,
> > > mgmtd, stonithd) which I can see on the other node?
> >
> > No. There's an IPC layer which is used by heartbeat (the process)
> > only. If that doesn't work, it won't start other programs.
> 
> 
> I did some more experimentation - I installed a third machine identical to
> the second one but still get the same results.

Then perhaps the problem is on the good host. Did you try to make
a cluster of only the second and the third host?

> One thing that I managed to change (on both the new machine and the previous
> "secondary") is that by moving aside the content of
> /usr/lib64/stonith/plugins/stonith2 and leaving only the "null" plugin in
> there I could get rid of the "stonith -h" segmentation fault (and I don't
> have any of the devices these plugins talk to anyway).

The stonith program problem is definitely annoying, but it is not
going to influence your cluster in any way.

> But still I don't see crmd and friends on any machine except for the
> primary.
> 
> Anyway, you would definitely see error messages if a program
> > can't be started.
> 
> 
> Where should I look for it? The init.d script forward most everything into
> /dev/null.

It's nothing to do with the init script. The heartbeat MCP
(master control process) starts all other processes itself. The
default syslog facility is daemon (2.0.x releases had local7).

Thanks,

Dejan

> > I'll try again with the default port, in case this matters.
> >
> > No, it shouldn't matter.
> 
> 
> Apparently it didn't matter :^)
> 
> Thanks very much for your time.
> 
> --Amos
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to