Re: [Linux-HA] When a network stops, a stop of the service of Heartbeat is late.

Dejan Muhamedagic Fri, 11 Apr 2008 04:39:55 -0700

Hi Yamauchi-san,

On Fri, Apr 11, 2008 at 05:22:37PM +0900, HIDEO YAMAUCHI wrote:
> Hi,
> 
> I made the environment that did not turn on Heartbeat with chkconfig.
> In this environment, a network stops earlier than Heartbeat.


But it should be the other way around. I would consider this
environment broken so you probably did this only for testing
purposes.

> With these two environmental nodes, I start Heartbeat.
> 
> I carried out shutdown with heartbeat having started in a DC node with the 
> resource.
> But, shutdown is not completed unless time passes for around 2 minutes.
> 
> I more easily confirmed the same situation in the following procedures.
> 
> 1)I start Heartbeat in two nodes.
> 2)I confirm that a resource starts in a DC node.
> 3)I stop network service in a DC node.
>   #service network stop
> 4)I stop Heartbeat in a DC node.
>   #service heartbeat stop
> 5)To a stop of Heartbeat of the DC node, it takes approximately 2 minutes.
> 
> I want to stop Heartbeat service by shorter time.
> Even if a network falls earlier than Heartbeat service....
> 
> In this stop time to take for a long time, can I change it by the setting of 
> the parameter of cib?

There are three big time gaps where crmd was waiting:

tengine[558]: 2008/04/11_16:27:02 info: send_rsc_command: Initiating action 7: 
prmIpPostgreSQLDB_start_0 on dl380g5d
crmd[553]: 2008/04/11_16:27:21 info: handle_shutdown_request: Creating shutdown 
request for dl380g5c
tengine[558]: 2008/04/11_16:28:02 WARN: action_timer_callback: Timer popped 
(abort_level=1000000, complete=false)
tengine[558]: 2008/04/11_16:28:02 WARN: print_elem: Action missed its 
timeout[Action 7]: In-flight (id: prmIpPostgreSQLDB_start_0, loc: dl380g5d, 
priority: 0)

Here it waited for the start operation to finish. This is a one
minute timeout.

tengine[558]: 2008/04/11_16:31:02 WARN: global_timer_callback: Timer popped 
(abort_level=1000000, complete=false)
tengine[558]: 2008/04/11_16:31:02 WARN: unconfirmed_actions: Waiting on 1 
unconfirmed actions

Again it waited for lrmd. This time for 3 minutes.

There are not many messages from the lrmd. Perhaps I should
include more. Could you please rerun this with debug set to 1.

> Is there the influence when I changed a parameter if I can appoint it in a 
> parameter?
> 
> * I used 2.1.3 versions in 64 bits environment.
> * In addition, this problem does not seem to happen very much
> in environment of 32 bits version.

That's interesting. I don't see how that could influence
anything.

> * I attached the log that I took.

There's a strange thing there:

crmd[553]: 2008/04/11_16:31:02 info: stop_subsystem: Sent -TERM to pengine: 
[559]
logd[31553]: 2008/04/11_16:31:12 debug: logd_term_action: waiting for 0 
messages to be read by write process
crmd[553]: 2008/04/11_16:31:02 info: do_shutdown: Waiting for subsystems to exit

How comes that the time of logd is ten seconds off?

Also, if it's not too much trouble, could you please switch to
syslog or turn syslogmsgfmt to true. I find it hard to follow all
the time two different message formats.

Thanks,

Dejan

> Regards,
> Hideo Yamauchi.


> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] When a network stops, a stop of the service of Heartbeat is late.

Reply via email to