On 2008-01-16T18:48:06, Keisuke MORI <[EMAIL PROTECTED]> wrote:
> Hello all,
>
> We have developed a new feature that detects a process failure directly
> to reduce the failover time.
>
> If you're interested in, please try this and give me your comments.
>
> See attached README for details about how to use this.
> The patch is made for heartbeat-2.1.3.
Hi Keisuke, instead of introducing a secondary process monitoring layer
which has to be configured additionally, why not instead enhance the
existing RAs to use a faster process checking technique?
If you are going this way, I think the RAs starting the individual
[services should sign in and tell procd (if it's running) what they want
monitored, and what rsc id it relates to - and, of course, notify procd
before stopping the process.
Instead of scanning /proc, which is very very Linux-specific, why not
use the async waitpid call instead? Or, you might decide to
poll()/select()/inotify() on the relevant proc dirs at least.
You can further use the async failure notification feature of the LRM to
directly tell it when a monitored process has failed; no need to do so
via the monitor call, which would then only need to be a backup function
to make sure procd is still running.
This would further reduce the error detection latency.
procd also probably should be started by a RA, not by a respawn line.
> +if [ -f ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs ]; then
> + FUNCTION_FILE="${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs"
> +elif [ -f /usr/lib64/heartbeat/ocf-shellfuncs ]; then
> + FUNCTION_FILE="/usr/lib64/heartbeat/ocf-shellfuncs"
> +elif [ -f /usr/lib/heartbeat/ocf-shellfuncs ]; then
> + FUNCTION_FILE="/usr/lib/heartbeat/ocf-shellfuncs"
> +else
> + echo "${OCF_RESOURCE_INSTANCE} ocf-shellfuncs file doesn't exist." >&2
> + exit 1
> +fi
This seems unneeded, the other RAs have the proper code too - to just
source a single file.
> +PROCD="/usr/lib/heartbeat/procd"
No hardcoded paths please.
> +<action name="monitor" timeout="60" depth="0" interval="10"
> start-delay="60" />
Start-delay shouldn't be needed.
> +#if 1
> + crm_log_init(crm_system_name, LOG_INFO, TRUE, FALSE, argc, argv);
> +#else
> + /* for before heartbeat 2.1.2 */
> + crm_log_init(crm_system_name, TRUE);
> +#endif
Please take out the compatibility code. It's not going to be merged
before 2.1.4, so we don't need to be compatible to anything before that
in the merged code.
> + /*
> + * connect with cib, and get objects's information cib.xml
> + */
This is not really a good idea. The code doesn't take the dynamic
properties of the CIB into account. Why don't you use the values
provided by the instance attributes?
> + /* get the whole of cib.xml */
> + do {
> + /*
> + * At SBY server, sometimes cib.xml doesn't exist.
> + * so, try to get cib.xml <max_failures> times.
> + */
What's this supposed to mean? The CIB is clearly available everywhere
where the CRM is running!
> + sprintf(cmdline_path, "/proc/%s/cmdline", dp->d_name);
sprintf is discouraged, please use snprintf.
Regards,
Lars
--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/