Hi, On Wed, Dec 09, 2009 at 05:22:18PM +0100, Achim Stumpf wrote: > Hi, > > Why this script is still not committed from the first post in > February to your development tree at > > http://hg.linux-ha.org/agents/file/e13565f0ea8a/heartbeat > > Or did I check at the wrong place?
You didn't check in the wrong place. I never actually got around to reviewing your script, there has been some discussion which didn't look conclusive. To reiterate, using KILL to remove processes is definitely excessive, unless all other means have been exhausted. I still don't see why the sequence STOP, TERM, CONT wouldn't fit. I just reviewed the script and parts of it are unmaintainable. In particular the sshd_stop function. That has to be significantly simplified. The get_and_stop_pids is recursive but there is no explanation what's happening there. There's also one echo command in there, probably remnant of debugging. Thanks, Dejan > > > Achim > > > > Achim Stumpf schrieb: > > Hi, > > > > Lars Ellenberg wrote: > >> On Fri, Feb 06, 2009 at 03:18:55PM +0100, Achim Stumpf wrote: > >>> Hi, > >>> > >>> I have written a ocf sshd RA script. It is based on the proftpd > >>> script. Feel free to use it and commit it please. > >>> > >>> I have written this script with the special option > >>> "OCF_RESKEY_killallchilds": > >> > >>> We have some ugly written cron like jobs here, which access our > >>> cluster via ssh. Most of them run in loops and open again and again > >>> ssh sessions to the cluster and through that on the drbd device. Or > >>> they start through ssh a loop on the cluster and the childs access the > >>> drbd device. > >>> > >>> With the function get_and_stop_pids I am able to get all childs of a > >>> process. But if the option is set to 0, sshd will terminate then > >>> without the above story. > >>> > >>> The workaround with fuser in RA Filesystem does not solve this issue, > >>> because the parent process starts new childs which will access the > >>> drbd device again for example. > >> > >> > >> the workaround solves it fine. > >> if you make your "applications" "cluster aware" in the following sense: > >> > >> iiuc, what you do now is basically > >> ssh cluster "while true; do some_job_which_uses_the_drbd ; done" > >> > >> > >> change that to > >> ssh cluster "cd /your/drbd/mount/point ; > >> while true; do ( some_job_which_uses_the_drbd ) ; done" > >> > > > > I am working for a company in the financial industry, and theses jobs > > are accessing the clusters via ssh and they access often in loops, as > > you and me have shown above. > > > >> > >> as the process (shell) the loop spawning new processes runs in > >> now has its cwd on DRBD, the "fuser -k" will find and kill it. > >> > >> I think that would be much easier than modifying the ssh RA. > >> > > > > If you have only a couple of scripts, which you could modify yourself, > > yes. But I am talking here of hundreds of jobs of people of my company > > and other companies, and if I tell them to change there jobs, this would > > never come to an end. > > I think it is not such a good idea to rely on code which is written to > > access the drbd device through sshd. Someone makes a mistake and a > > failover would fail. > > > > Cheers, > > > > Achim > > > > > _______________________________________________________ > Linux-HA-Dev: [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ _______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
