Hi Lars,
thank you for your comments.
Lars Marowsky-Bree <[EMAIL PROTECTED]> writes:
> On 2008-01-16T18:48:06, Keisuke MORI <[EMAIL PROTECTED]> wrote:
>
>> Hello all,
>>
>> We have developed a new feature that detects a process failure directly
>> to reduce the failover time.
>>
>> If you're interested in, please try this and give me your comments.
>>
>> See attached README for details about how to use this.
>> The patch is made for heartbeat-2.1.3.
>
> Hi Keisuke, instead of introducing a secondary process monitoring layer
> which has to be configured additionally, why not instead enhance the
> existing RAs to use a faster process checking technique?
The background of why we developed this tool is that:
1) We want to detect a process failure asynchronously,
not only by the periodic monitor operations, to cause a
failover faster to minimize the service downtime.
2) We want to make it usable as "an additional feature" for
arbitrary applications without modifying existent RAs and
the application itself.
As for 1), the asynchronous detection can not be achieved by
the normal monitor operation of RAs. And some users who care the
behavior under the heavy load think that a monitor operation
itself is heavy because it does fork/exec some shell commands,
and that eventually delays the detection of failures.
As for 2), the current usage of this tools is that, if you have
already configured your system, just make a group your RA
with 'procdctl' RA and add some parameters in cib.xml. You don't
need to modify and test your application/RA again in order to
use this function.
>
> If you are going this way, I think the RAs starting the individual
> [services should sign in and tell procd (if it's running) what they want
> monitored, and what rsc id it relates to - and, of course, notify procd
> before stopping the process.
>
> Instead of scanning /proc, which is very very Linux-specific, why not
> use the async waitpid call instead? Or, you might decide to
> poll()/select()/inotify() on the relevant proc dirs at least.
Yes, we're aware that the current implementation is very Linux
specific, and we're grad to modify it to be more portable.
But for those techniques, waitpid() can handle only its child
process and it can not be used to monitor a process launched
by heartbeat. By using poll()/select()/inotify(), it can not be
detect if a process gets to "the zombie state" as long as we studied.
Please let me know if I'm wrong, or there's better way to do this.
For the meantime, why don't we start from "Linux only" as the
start point and fix it more portable upon users demands?
>
> You can further use the async failure notification feature of the LRM to
> directly tell it when a monitored process has failed; no need to do so
> via the monitor call, which would then only need to be a backup function
> to make sure procd is still running.
>
> This would further reduce the error detection latency.
Well, probably I'm missing your point...,
the procd is already using the asynchronous notification to the
CRM in the same manner of 'crm_resource -F' command and that is
the primary purpose of this tool.
Please point me out if I'm misunderstanding what you mean.
> procd also probably should be started by a RA, not by a respawn line.
It's a respawned daemon because it can be used if you want to
montor two or more applications.
>
>> +if [ -f ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs ]; then
>> + FUNCTION_FILE="${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs"
>> +elif [ -f /usr/lib64/heartbeat/ocf-shellfuncs ]; then
>> + FUNCTION_FILE="/usr/lib64/heartbeat/ocf-shellfuncs"
>> +elif [ -f /usr/lib/heartbeat/ocf-shellfuncs ]; then
>> + FUNCTION_FILE="/usr/lib/heartbeat/ocf-shellfuncs"
>> +else
>> + echo "${OCF_RESOURCE_INSTANCE} ocf-shellfuncs file doesn't exist." >&2
>> + exit 1
>> +fi
>
> This seems unneeded, the other RAs have the proper code too - to just
> source a single file.
The code above is from our compatibility code with 2.0.8 (because we
have a customer already providing their service with 2.0.8),
but yes, it's not needed for the future release and I will remove it.
>
>> +PROCD="/usr/lib/heartbeat/procd"
>
> No hardcoded paths please.
>
>> +<action name="monitor" timeout="60" depth="0" interval="10"
>> start-delay="60" />
>
> Start-delay shouldn't be needed.
>
>> +#if 1
>> + crm_log_init(crm_system_name, LOG_INFO, TRUE, FALSE, argc, argv);
>> +#else
>> + /* for before heartbeat 2.1.2 */
>> + crm_log_init(crm_system_name, TRUE);
>> +#endif
>
> Please take out the compatibility code. It's not going to be merged
> before 2.1.4, so we don't need to be compatible to anything before that
> in the merged code.
You're right and we'll fix them.
>
>> + /*
>> + * connect with cib, and get objects's information cib.xml
>> + */
>
> This is not really a good idea. The code doesn't take the dynamic
> properties of the CIB into account. Why don't you use the values
> provided by the instance attributes?
You're right, we didn't consider when the CIB is configured
dynamically. We will fix the design regarding to this.
>
>> + /* get the whole of cib.xml */
>> + do {
>> + /*
>> + * At SBY server, sometimes cib.xml doesn't exist.
>> + * so, try to get cib.xml <max_failures> times.
>> + */
>
> What's this supposed to mean? The CIB is clearly available everywhere
> where the CRM is running!
This is related to the previous comment and will be eliminated
after it gets fixed.
>
>> + sprintf(cmdline_path, "/proc/%s/cmdline", dp->d_name);
>
> sprintf is discouraged, please use snprintf.
Right. will fix.
Thank you again for all of your comment.
I'll start to fix them and if there're further comments please
let me know.
Regards,
Keisuke MORI
NTT DATA Intellilink Corporation
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/