So to add some color to this problem, I'm seeing these errors on box02:
ERROR: Exiting pengine process 4689 dumped core ERROR: crmdManagedChildDied:subsystems.c The pengine subsystem terminated unexpectedly ERROR: do_log:misc.c [[FSA]] Input I_ERROR from crmdManagedChildDied() received in state (S_TRANSITION_ENGINE) ERROR: do_recover:control.c Action A_RECOVER (0000000001000000) not supported ERROR: do_log:misc.c [[FSA]] Input I_STOP from do_recover() received in state (S_RECOVERY) This is a 2 box setup migrating an IP and service (OpenSER) to run on one of the nodes at any given time. Box01 doesn't see any errors and just leaving the nodes alone to work it out doesn't work. I am unclear on the cib.xmlbecause I haven't seen it work any other way. Is it normal to have that file rewritten every 5 seconds or so? Seems like that's abnormal to me but can't get that confirmed in the docs. I'm happy to grab any other information if it might help. Thanks. -Anders On 6/12/07, Anders Brownworth <[EMAIL PROTECTED]> wrote:
Andrew, Sorry about the formatting. Attached are the /var/log/messages output of box01 and box02 as well as the cib.xml. Also, this is heartbeat version 2.0.7. As you can see in the messages, this isn't a problem that gets worked out. Both machines keep writing and re-writing their cib.xml files all day long. I gather this isn't normal. Thanks for any light you can shed. -Anders On 6/12/07, Andrew Beekhof < [EMAIL PROTECTED]> wrote: > > On 6/12/07, Anders Brownworth < [EMAIL PROTECTED]> wrote: > > Hi, > > > > I have a heartbeat v2 setup that I am trying to use to migrate OpenSER > and > > an IP address back and forth between 2 boxes. (box01 and box02) I > wrote an > > OCF for OpenSER and am using the heartbeat provided IPaddr to manage > the IP > > address. My OCF checks out with the ocf-tester script and my > > cib.xmlverifies with crm_verify -x /var/lib/heartbeat/crm/cib.xml. > > > > When I start both nodes with exactly the same configuration, they > fight > > about what the state of things and the first ERROR I get is: > > > > crmd: [3507]: ERROR: do_exit:control.c Could not recover from internal > error > > > > in the /var/log/messages of box02 and neither service is started. (IP > nor > > OpenSER) Both boxes give off a pile of info and warning messages that > don't > > seem to point me in a worthwhile direction. > > > > Both boxes seem to get to "info: main:attrd.c Starting mainloop..." > without > > any issues. But once they try to decide on who has what, they start > fighting > > and constantly rewriting their cib.xml files over and over unendedly. > (Is > > this normal?) crmd starts complaning about pengine dying on a signal > 14 and > > everyting pretty much goes to hell from there. > > it _should_ be able to recover. > > can you send your logs as attachments? (trying to read logs wrapped at > 80 chars is hell) > > you also didnt mention what version you're running > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > -- -Anders ----------------------------------------------------------- Anders Brownworth [EMAIL PROTECTED]
-- -Anders ----------------------------------------------------------- Anders Brownworth [EMAIL PROTECTED] _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
