On 2008-01-22T17:37:15, Keisuke MORI <[EMAIL PROTECTED]> wrote:

> The background of why we developed this tool is that:
> 1) We want to detect a process failure asynchronously,
>    not only by the periodic monitor operations, to cause a
>    failover faster to minimize the service downtime.

Right, that's a good idea.

> 2) We want to make it usable as "an additional feature" for
>    arbitrary applications without modifying existent RAs and
>    the application itself.

I see your point, and that's certainly a valid use case. 

Nevertheless, I'd argue that for our included RAs, it would be nice if
they would auto-register and make this functionality available
completely automatically. This would be easier for users, and easier to
maintain.

Even if not all RAs do that immediately, I think having simple-to-use
shell functions for them to do so would be immensely useful.

The RAs would also selectively sign in and out as resources are stopped
or started, or add/remove processes from the watch list as required by
other actions (such as promote/demote, or other extensions in the
future.) I think that would be more fine-grained.

> But for those techniques, waitpid() can handle only its child
> process and it can not be used to monitor a process launched
> by heartbeat. By using poll()/select()/inotify(), it can not be
> detect if a process gets to "the zombie state" as long as we studied.
> Please let me know if I'm wrong, or there's better way to do this.

No, I think you are right. I didn't consider the Z state. It might be
possible to somehow get at that state asynchronously via inotify() or
kernel events, but I don't readily know how.

Using these async mechanisms though would provide a further speed
advantage, and reduce the load (less polling). Processes dieing
completely is also, I think, more likely than processes going zombie.

Maybe a future version could combine both techniques? Use async
notifications to capture process deaths immediately, and periodically
scan (possibly at a lower frequency) for zombies.

Or leave the zombie scan (as well as checking for otherwise unresponsive
or malfunctioning processes) to the monitor op of the RA proper.

> the procd is already using the asynchronous notification to the
> CRM in the same manner of 'crm_resource -F' command and that is
> the primary purpose of this tool.
> 
> Please point me out if I'm misunderstanding what you mean.

No, I misread the code. Thanks for correcting me.

> > procd also probably should be started by a RA, not by a respawn line.
> It's a respawned daemon because it can be used if you want to
> montor two or more applications. 

Agreed, but I guess making the daemon a resource which is managed itself
would make it possible to monitor and restart as needed. Same as for
pingd, I think. 

> Thank you again for all of your comment.
>
> I'll start to fix them and if there're further comments please
> let me know.

Thanks for this useful tool!


Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to