Re: [Linux-HA] Will Linux heartbeat reset machine?

Dejan Muhamedagic Tue, 28 Dec 2010 09:38:33 -0800

On Mon, Dec 27, 2010 at 08:02:12PM +0800, Bin Chen(sunwen_ling) wrote:
> On Mon, Dec 27, 2010 at 7:14 PM, Dejan Muhamedagic <[email protected]>wrote:
> 
> > Hi,
> >
> > On Mon, Dec 27, 2010 at 04:27:14PM +0800, Bin Chen(sunwen_ling) wrote:
> > > Hi guys,
> > >
> > > I installed linux heartbeat into one machine, the problem is after we
> > > started the heartbeat for several seconds, the machine is rebooted, I can
> > > see the problem is the configuration cib.xml is not valid, but is it a
> > right
> > > behavior that the machine with invalid cib.xml will be reset? Btw the
> > > STONITH is disabled, attached the log. I am also wondering if the
> > behavior
> > > is right, can I disable it to reset as we are with a server machine, the
> > > reboot process is painful slow.
> >
> > I guess that this is because of the "crm yes" directive in ha.cf.
> > You can change it to "crm respawn".
> >
> > Thanks,
> >
> > Dejan
> >
> 
> Thanks Dejan, can you please explain why this option will cause the machine
> be reset?


On critical process exiting it is safer to restart the node.

Thanks,

Dejan

> Bin
> 
> >
> > > Thanks.
> > > Bin
> > >
> > > Dec 27 23:40:57 ucs22 lrmd: [4993]: WARN: Initializing connection to
> > logging
> > > daemon failed. Logging daemon may not be running
> > > Dec 27 23:40:57 ucs22 lrmd: [4993]: info: G_main_add_SignalHandler: Added
> > > signal handler for signal 15
> > > Dec 27 23:40:57 ucs22 lrmd: [4993]: info: G_main_add_SignalHandler: Added
> > > signal handler for signal 17
> > > Dec 27 23:40:57 ucs22 lrmd: [4993]: info: enabling coredumps
> > > Dec 27 23:40:57 ucs22 lrmd: [4993]: info: G_main_add_SignalHandler: Added
> > > signal handler for signal 10
> > > Dec 27 23:40:57 ucs22 lrmd: [4993]: info: G_main_add_SignalHandler: Added
> > > signal handler for signal 12
> > > Dec 27 23:40:57 ucs22 lrmd: [4993]: info: Started.
> > > Dec 27 23:40:57 ucs22 stonithd: [4994]: WARN: Initializing connection to
> > > logging daemon failed. Logging daemon may not be running
> > > Dec 27 23:40:57 ucs22 stonithd: [4994]: info: G_main_add_SignalHandler:
> > > Added signal handler for signal 10
> > > Dec 27 23:40:57 ucs22 stonithd: [4994]: info: G_main_add_SignalHandler:
> > > Added signal handler for signal 12
> > > Dec 27 23:40:57 ucs22 attrd: [4995]: info: crm_log_init: Changed active
> > > directory to /var/lib/heartbeat/cores/hacluster
> > > Dec 27 23:40:57 ucs22 attrd: [4995]: WARN: Initializing connection to
> > > logging daemon failed. Logging daemon may not be running
> > > Dec 27 23:40:57 ucs22 attrd: [4995]: info: Invoked:
> > > /usr/lib64/heartbeat/attrd
> > > Dec 27 23:40:57 ucs22 cib: [4992]: info: crm_log_init: Changed active
> > > directory to /var/lib/heartbeat/cores/hacluster
> > > Dec 27 23:40:57 ucs22 crmd: [4996]: info: crm_log_init: Changed active
> > > directory to /var/lib/heartbeat/cores/hacluster
> > > Dec 27 23:40:57 ucs22 ccm: [4991]: WARN: Initializing connection to
> > logging
> > > daemon failed. Logging daemon may not be running
> > > Dec 27 23:40:57 ucs22 stonithd: [4994]: info: register_heartbeat_conn:
> > > Hostname: ucs22
> > > Dec 27 23:40:57 ucs22 attrd: [4995]: info: main: Starting up
> > > Dec 27 23:40:57 ucs22 cib: [4992]: WARN: Initializing connection to
> > logging
> > > daemon failed. Logging daemon may not be running
> > > Dec 27 23:40:57 ucs22 crmd: [4996]: WARN: Initializing connection to
> > logging
> > > daemon failed. Logging daemon may not be running
> > > Dec 27 23:40:57 ucs22 ccm: [4991]: info: Hostname: ucs22
> > > Dec 27 23:40:57 ucs22 stonithd: [4994]: info: register_heartbeat_conn:
> > UUID:
> > > b8fc4074-c40e-48e4-80ad-a9b63fd4bf77
> > > Dec 27 23:40:57 ucs22 cib: [4992]: info: Invoked:
> > /usr/lib64/heartbeat/cib
> > > Dec 27 23:40:57 ucs22 crmd: [4996]: info: Invoked:
> > /usr/lib64/heartbeat/crmd
> > >
> > > Dec 27 23:40:57 ucs22 stonithd: [4994]: info: crm_cluster_connect:
> > > Connecting to Heartbeat
> > > Dec 27 23:40:57 ucs22 cib: [4992]: info: G_main_add_TriggerHandler: Added
> > > signal manual handler
> > > Dec 27 23:40:57 ucs22 crmd: [4996]: info: main: CRM Hg Version:
> > > da7075976b5ff0bee71074385f8fd02f296ec8a3
> > > Dec 27 23:40:57 ucs22 cib: [4992]: info: G_main_add_SignalHandler: Added
> > > signal handler for signal 17
> > > Dec 27 23:40:57 ucs22 crmd: [4996]: info: crmd_init: Starting crmd
> > > Dec 27 23:40:57 ucs22 cib: [4992]: ERROR: crm_is_writable:
> > > /var/lib/heartbeat/crm/cib.xml must be owned and r/w by user hacluster
> > > Dec 27 23:40:57 ucs22 crmd: [4996]: info: G_main_add_SignalHandler: Added
> > > signal handler for signal 17
> > > Dec 27 23:40:57 ucs22 cib: [4992]: info: retrieveCib: Reading cluster
> > > configuration from: /var/lib/heartbeat/crm/cib.xml (digest:
> > > /var/lib/heartbeat/crm/cib.xml.sig)
> > > Dec 27 23:40:57 ucs22 cib: [4992]: WARN: validate_cib_digest: No on-disk
> > > digest present
> > > Dec 27 23:40:57 ucs22 cib: [4992]: ERROR: Expecting an element nodes, got
> > > nothing
> > > Dec 27 23:40:57 ucs22 cib: [4992]: ERROR: Invalid sequence in interleave
> > > Dec 27 23:40:57 ucs22 cib: [4992]: ERROR: Element configuration failed to
> > > validate content
> > > Dec 27 23:40:57 ucs22 cib: [4992]: ERROR: Element cib failed to validate
> > > content
> > > Dec 27 23:40:57 ucs22 cib: [4992]: ERROR: readCibXmlFile: CIB does not
> > > validate with pacemaker-1.0
> > > Dec 27 23:40:57 ucs22 cib: [4992]: info: startCib: CIB Initialization
> > > completed successfully
> > > Dec 27 23:40:57 ucs22 attrd: [4995]: info: register_heartbeat_conn:
> > > Hostname: ucs22
> > > Dec 27 23:40:57 ucs22 attrd: [4995]: info: register_heartbeat_conn: UUID:
> > > b8fc4074-c40e-48e4-80ad-a9b63fd4bf77
> > > Dec 27 23:40:57 ucs22 attrd: [4995]: info: crm_cluster_connect:
> > Connecting
> > > to Heartbeat
> > > Dec 27 23:40:57 ucs22 attrd: [4995]: info: main: Cluster connection
> > active
> > > Dec 27 23:40:57 ucs22 attrd: [4995]: info: main: Accepting attribute
> > updates
> > > Dec 27 23:40:57 ucs22 attrd: [4995]: info: main: Starting mainloop...
> > > Dec 27 23:40:57 ucs22 stonithd: [4994]: notice:
> > > /usr/lib64/heartbeat/stonithd start up successfully.
> > > Dec 27 23:40:57 ucs22 stonithd: [4994]: info: G_main_add_SignalHandler:
> > > Added signal handler for signal 17
> > > Dec 27 23:40:57 ucs22 cib: [4992]: info: register_heartbeat_conn:
> > Hostname:
> > > ucs22
> > > Dec 27 23:40:57 ucs22 cib: [4992]: info: register_heartbeat_conn: UUID:
> > > b8fc4074-c40e-48e4-80ad-a9b63fd4bf77
> > > Dec 27 23:40:57 ucs22 cib: [4992]: info: crm_cluster_connect: Connecting
> > to
> > > Heartbeat
> > > Dec 27 23:40:57 ucs22 cib: [4992]: info: ccm_connect: Registering with
> > > CCM...
> > > Dec 27 23:40:57 ucs22 cib: [4992]: WARN: ccm_connect: CCM Activation
> > failed
> > > Dec 27 23:40:57 ucs22 cib: [4992]: WARN: ccm_connect: CCM Connection
> > failed
> > > 1 times (30 max)
> > > Dec 27 23:40:58 ucs22 crmd: [4996]: info: do_cib_control: Could not
> > connect
> > > to the CIB service: connection failed
> > > Dec 27 23:40:58 ucs22 crmd: [4996]: WARN: do_cib_control: Couldn't
> > complete
> > > CIB registration 1 times... pause and retry
> > > Dec 27 23:40:58 ucs22 crmd: [4996]: info: crmd_init: Starting crmd's
> > > mainloop
> > > Dec 27 23:40:58 ucs22 ccm: [4991]: info: G_main_add_SignalHandler: Added
> > > signal handler for signal 15
> > > Dec 27 23:41:00 ucs22 crmd: [4996]: info: crm_timer_popped: Wait Timer
> > > (I_NULL) just popped!
> > > Dec 27 23:41:00 ucs22 cib: [4992]: info: ccm_connect: Registering with
> > > CCM...
> > > Dec 27 23:41:00 ucs22 cib: [4992]: info: cib_init: Requesting the list of
> > > configured nodes
> > > Dec 27 23:41:00 ucs22 cib: [4992]: info: cib_init: Starting cib mainloop
> > > Dec 27 23:41:00 ucs22 cib: [4992]: info: cib_client_status_callback:
> > Status
> > > update: Client ucs22/cib now has status [join]
> > > Dec 27 23:41:00 ucs22 cib: [4992]: info: crm_new_peer: Node 0 is now
> > known
> > > as ucs22
> > > Dec 27 23:41:00 ucs22 cib: [4992]: info: crm_update_peer_proc: ucs22.cib
> > is
> > > now online
> > > Dec 27 23:41:01 ucs22 cib: [4992]: info: cib_client_status_callback:
> > Status
> > > update: Client ucs22/cib now has status [online]
> > > Dec 27 23:41:01 ucs22 crmd: [4996]: info: do_cib_control: CIB connection
> > > established
> > > Dec 27 23:41:01 ucs22 cib: [4992]: ERROR: cib_process_request: Operation
> > > ignored, cluster configuration is invalid. Please repair and restart:
> > Update
> > > does not conform to the configured schema/DTD
> > > Dec 27 23:41:01 ucs22 cib: [4992]: info: cib_client_status_callback:
> > Status
> > > update: Client ucs26/cib now has status [online]
> > > Dec 27 23:41:01 ucs22 cib: [4992]: info: crm_new_peer: Node 0 is now
> > known
> > > as ucs26
> > > Dec 27 23:41:01 ucs22 cib: [4992]: info: crm_update_peer_proc: ucs26.cib
> > is
> > > now online
> > > Dec 27 23:41:01 ucs22 crmd: [4996]: info: register_heartbeat_conn:
> > Hostname:
> > > ucs22
> > > Dec 27 23:41:01 ucs22 crmd: [4996]: info: register_heartbeat_conn: UUID:
> > > b8fc4074-c40e-48e4-80ad-a9b63fd4bf77
> > > Dec 27 23:41:01 ucs22 crmd: [4996]: info: crm_cluster_connect: Connecting
> > to
> > > Heartbeat
> > > Dec 27 23:41:02 ucs22 crmd: [4996]: info: do_ha_control: Connected to the
> > > cluster
> > > Dec 27 23:41:02 ucs22 crmd: [4996]: info: do_ccm_control: CCM connection
> > > established... waiting for first callback
> > > Dec 27 23:41:02 ucs22 crmd: [4996]: info: do_started: Delaying start, CCM
> > > (0000000000100000) not connected
> > > Dec 27 23:41:02 ucs22 cib: [4992]: ERROR: cib_process_request: Operation
> > > ignored, cluster configuration is invalid. Please repair and restart:
> > Update
> > > does not conform to the configured schema/DTD
> > > Dec 27 23:41:02 ucs22 crmd: [4996]: ERROR: config_query_callback: Local
> > CIB
> > > query resulted in an error: Update does not conform to the configured
> > > schema/DTD
> > > Dec 27 23:41:02 ucs22 crmd: [4996]: ERROR: config_query_callback: The
> > > cluster is mis-configured - shutting down and staying down
> > > Dec 27 23:41:02 ucs22 crmd: [4996]: notice: crmd_client_status_callback:
> > > Status update: Client ucs22/crmd now has status [online] (DC=false)
> > > Dec 27 23:41:02 ucs22 attrd: [4995]: info: cib_connect: Connected to the
> > CIB
> > > after 1 signon attempts
> > > Dec 27 23:41:02 ucs22 attrd: [4995]: info: cib_connect: Sending full
> > refresh
> > > Dec 27 23:41:02 ucs22 crmd: [4996]: info: crm_new_peer: Node 0 is now
> > known
> > > as ucs22
> > > Dec 27 23:41:02 ucs22 crmd: [4996]: info: crm_update_peer_proc:
> > ucs22.crmd
> > > is now online
> > > Dec 27 23:41:02 ucs22 crmd: [4996]: info: crmd_client_status_callback:
> > Not
> > > the DC
> > > Dec 27 23:41:02 ucs22 crmd: [4996]: notice: crmd_client_status_callback:
> > > Status update: Client ucs22/crmd now has status [online] (DC=false)
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: crmd_client_status_callback:
> > Not
> > > the DC
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: notice: crmd_client_status_callback:
> > > Status update: Client ucs26/crmd now has status [online] (DC=false)
> > > Dec 27 23:41:03 ucs22 cib: [4992]: WARN: cib_peer_callback: Discarding
> > > cib_apply_diff message (808) from ucs26: not in our membership
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: crm_new_peer: Node 0 is now
> > known
> > > as ucs26
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: crm_update_peer_proc:
> > ucs26.crmd
> > > is now online
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: crmd_client_status_callback:
> > Not
> > > the DC
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: ERROR: do_log: FSA: Input I_ERROR
> > from
> > > config_query_callback() received in state S_STARTING
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: do_state_transition: State
> > > transition S_STARTING -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL
> > > origin=config_query_callback ]
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: ERROR: do_recover: Action A_RECOVER
> > > (0000000001000000) not supported
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: ERROR: do_log: FSA: Input I_ERROR
> > from
> > > revision_check_callback() received in state S_RECOVERY
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: do_dc_release: DC role released
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: do_te_control: Transitioner is
> > now
> > > inactive
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: ERROR: do_started: Start cancelled...
> > > S_RECOVERY
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: ERROR: do_log: FSA: Input I_TERMINATE
> > > from do_recover() received in state S_RECOVERY
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: do_state_transition: State
> > > transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE
> > > cause=C_FSA_INTERNAL origin=do_recover ]
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: do_shutdown: All subsystems
> > > stopped, continuing
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: do_lrm_control: Disconnected
> > from
> > > the LRM
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: do_ha_control: Disconnected
> > from
> > > Heartbeat
> > > Dec 27 23:41:03 ucs22 ccm: [4991]: info: client (pid=4996) removed from
> > ccm
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: do_cib_control: Disconnecting
> > CIB
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: crmd_cib_connection_destroy:
> > > Connection to the CIB terminated...
> > > Dec 27 23:41:03 ucs22 cib: [4992]: ERROR: cib_process_request: Operation
> > > ignored, cluster configuration is invalid. Please repair and restart:
> > Update
> > > does not conform to the configured schema/DTD
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: do_exit: Performing A_EXIT_0 -
> > > gracefully exiting the CRMd
> > > Dec 27 23:41:03 ucs22 cib: [4992]: WARN: send_ipc_message: IPC Channel to
> > > 4996 is not connected
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: ERROR: do_exit: Could not recover
> > from
> > > internal error
> > > Dec 27 23:41:03 ucs22 cib: [4992]: WARN: send_via_callback_channel:
> > Delivery
> > > of reply to client 4996/78d426cb-9410-4d60-99fd-42fa190683c7 failed
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: WARN: do_exit: Inhibiting respawn by
> > > Heartbeat
> > > Dec 27 23:41:03 ucs22 cib: [4992]: WARN: do_local_notify: A-Sync reply to
> > > crmd failed: reply failed
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: free_mem: Dropping
> > > I_RELEASE_SUCCESS: [ state=S_TERMINATE cause=C_FSA_INTERNAL
> > > origin=do_dc_release ]
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: free_mem: Dropping I_TERMINATE:
> > [
> > > state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
> > > Dec 27 23:41:03 ucs22 crmd: [4996]: info: do_exit: [crmd] stopped (100)
> > > Dec 27 23:41:04 ucs22 kernel: device eth2 entered promiscuous mode
> > > Dec 27 23:41:05 ucs22 kernel: md: stopping all md devices.
> > > Dec 27 23:41:06 ucs22 kernel: Synchronizing SCSI cache for disk sdb:
> > > Dec 27 23:41:06 ucs22 kernel: Synchronizing SCSI cache for disk sda:
> > > Dec 27 23:41:06 ucs22 kernel: ACPI: PCI interrupt for device 0000:12:00.1
> > > disabled
> > > Dec 27 23:41:06 ucs22 kernel: ACPI: PCI interrupt for device 0000:12:00.0
> > > disabled
> > > Dec 27 23:41:06 ucs22 kernel: ACPI: PCI interrupt for device 0000:08:00.1
> > > disabled
> > > Dec 27 23:41:06 ucs22 kernel: ACPI: PCI interrupt for device 0000:08:00.0
> > > disabled
> > > Dec 27 23:41:06 ucs22 kernel: usb 8-1: new full speed USB device using
> > > uhci_hcd and address 2
> > > Dec 27 23:41:06 ucs22 kernel: ACPI: PCI interrupt for device 0000:05:00.1
> > > disabled
> > > Dec 27 23:41:06 ucs22 kernel: usb 8-1: not running at top speed; connect
> > to
> > > a high speed hub
> > > Dec 27 23:41:06 ucs22 kernel: usb 8-1: configuration #1 chosen from 1
> > choice
> > > Dec 27 23:41:06 ucs22 kernel: hub 8-1:1.0: USB hub found
> > > Dec 27 23:41:06 ucs22 kernel: hub 8-1:1.0: 4 ports detected
> > > Dec 27 23:41:06 ucs22 kernel: ACPI: PCI interrupt for device 0000:05:00.0
> > > disabled
> > > Dec 27 23:47:35 ucs22 syslogd 1.4.1: restart.
> > > _______________________________________________
> > > Linux-HA mailing list
> > > [email protected]
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Will Linux heartbeat reset machine?

Reply via email to