On 2008-01-22T17:37:15, Keisuke MORI <[EMAIL PROTECTED]> wrote:
> The background of why we developed this tool is that:
> 1) We want to detect a process failure asynchronously,
> not only by the periodic monitor operations, to cause a
> failover faster to minimize the service downtime.
Right, that's a good idea.
> 2) We want to make it usable as "an additional feature" for
> arbitrary applications without modifying existent RAs and
> the application itself.
I see your point, and that's certainly a valid use case.
Nevertheless, I'd argue that for our included RAs, it would be nice if
they would auto-register and make this functionality available
completely automatically. This would be easier for users, and easier to
maintain.
Even if not all RAs do that immediately, I think having simple-to-use
shell functions for them to do so would be immensely useful.
The RAs would also selectively sign in and out as resources are stopped
or started, or add/remove processes from the watch list as required by
other actions (such as promote/demote, or other extensions in the
future.) I think that would be more fine-grained.
> But for those techniques, waitpid() can handle only its child
> process and it can not be used to monitor a process launched
> by heartbeat. By using poll()/select()/inotify(), it can not be
> detect if a process gets to "the zombie state" as long as we studied.
> Please let me know if I'm wrong, or there's better way to do this.
No, I think you are right. I didn't consider the Z state. It might be
possible to somehow get at that state asynchronously via inotify() or
kernel events, but I don't readily know how.
Using these async mechanisms though would provide a further speed
advantage, and reduce the load (less polling). Processes dieing
completely is also, I think, more likely than processes going zombie.
Maybe a future version could combine both techniques? Use async
notifications to capture process deaths immediately, and periodically
scan (possibly at a lower frequency) for zombies.
Or leave the zombie scan (as well as checking for otherwise unresponsive
or malfunctioning processes) to the monitor op of the RA proper.
> the procd is already using the asynchronous notification to the
> CRM in the same manner of 'crm_resource -F' command and that is
> the primary purpose of this tool.
>
> Please point me out if I'm misunderstanding what you mean.
No, I misread the code. Thanks for correcting me.
> > procd also probably should be started by a RA, not by a respawn line.
> It's a respawned daemon because it can be used if you want to
> montor two or more applications.
Agreed, but I guess making the daemon a resource which is managed itself
would make it possible to monitor and restart as needed. Same as for
pingd, I think.
> Thank you again for all of your comment.
>
> I'll start to fix them and if there're further comments please
> let me know.
Thanks for this useful tool!
Regards,
Lars
--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/