Dear Laurence, your last lines are exactly what we need ! Thank you for this. > set remote = "/bin/csh $WIENROOT/pbsh" > > $WIENROOT/pbsh is just > mpirun -x LD_LIBRARY_PATH -x PATH -np 1 --host $1 /bin/csh -c " $2 " I will try but I pretty sure that it will work fine. Regards Florent
Le 05/01/2012 20:16, Laurence Marks a ?crit : > I gave a slightly jetlagged response -- for certain WIEN2k style works > fine with all queuing systems. > > But...it may not fit how the queuing system has been designed and > admins may not be accomodating. My understanding (second hand) is that > torque is designed to work well with openmpi for accounting, and by > default knows nothing about tasks created by ssh. When the users time > has elapsed it will terminate those tasks it knows about (the main one > plus anything using mpirun) and ignore anything else. Hence for > clusters where killing a ssh on node A does not propogate a kill to > children on node B (which depends upon the ssh) one is left with > processes that can run forever. There is something called an epilog > script which maybe can do this, but it would need WIEN2k to create one > every time it launches a set of tasks. Possible, but not trivial. > > Note: this is not just a WIEN2k problem. One of the admin's at NU > large cluster is a friend and he tells me that every now an then he > goes around and tries to clean up tasks left running like this on > nodes from all sorts of software. Sometimes he has to reboot nodes > since if torque believes there is nothing running on a node it will > merrily create more tasks on it which can lead to heavy > oversubscription and hang the node. > > And...just to make life more fun, torque knows nothing about MKL > threading so on an 8-core node can easily start 8 different non-mpi > jobs and if they all want 8 threads... > > Probably too long a response. Below is the parallel_options file that > I use on a system with moab (similar, perhaps worse than pbs) where I > try and be a "gentleman" and set the mkl threading as well as use > miprun to launch tasks. > > setenv USE_REMOTE 1 > setenv MPI_REMOTE 0 > setenv WIEN_GRANULARITY 1 > setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_ > -machinefile _HOSTS_ _EXEC_" > set a=`grep -e "1:" .machines | grep -v lapw0 | head -1 | cut -f 3 -d: > | cut -c 1-2` > setenv MKL_NUM_THREADS $a > setenv OMP_NUM_THREADS $a > setenv MKL_DYNAMIC FALSE > if (-e local_options ) source local_options > set remote = "/bin/csh $WIENROOT/pbsh" > set delay = 0.25 > > $WIENROOT/pbsh is just > mpirun -x LD_LIBRARY_PATH -x PATH -np 1 --host $1 /bin/csh -c " $2 " > > With this at least I don't create problems (hopefully). > > On Thu, Jan 5, 2012 at 7:19 AM, Peter Blaha > <pblaha at theochem.tuwien.ac.at> wrote: >> It is NOT true that queuing systems cannot do the "WIEN2k style". >> >> We have two big clusters and run on them all three types of jobs, >> i) only ssh (k-parallel), ii) only mpi-parallel (no mpi) and also >> of mixed type. >> >> And of course the administrators configured the "sun grid engine" so that it >> makes sure that there are no processes running when a job finishes and >> eventually >> kill all processes of a batch job on all the assigned nodes after it has >> finished. >> >> It's just a matter if the system programmers are willing (or able ??) to >> reconfigure >> the queuing system... >> >> PS: If you are running mpi-parallel use setenv MPI_REMOTE 0 in >> $WIENROOT/parallel_options and ssh will not be used anyway. >> >> Am 05.01.2012 13:17, schrieb Laurence Marks: >>> As Florent said, this is a known issue with some (not all) versions ofssh, >>> and it is also a torque bug. What you have to do is use mpiruninstead of ssh >>> to launch jobs which I think you can do by setting theMPI_REMOTE/USE_REMOTE >>> switches. I think I posted how to do this sometime ago, so please search the >>> mailing list. (I am in China and canprovide more information next week when >>> I return if this is notenough, which it probably is not.) >>> N.B., in case anyone wonders with torque (PBS) you are not "supposedto" >>> use ssh to communicate the way Wien2k does. They are not going tomove on >>> this so this is "WIen2k's fault". I've looked in to this quitea bit and >>> there is no solution except to avoid ssh (or live withzombie processes). >>> Indeed, torque has the weakness of leavingprocesses around if a code does >>> anything more adventurous than justrun a single mpirun -- so it goes. >>> On Thu, Jan 5, 2012 at 3:22 AM, Peter Blaha<pblaha at theochem.tuwien.ac.at> >>> wrote:> I've never done this myself, but as far as I know one can >>> define >>> a> "prolog" script in all those queuing systems and this prolog script> >>> should ssh to all assigned nodes and kill all remaining jobs of this >>> user.>>> Am 05.01.2012 10:17, schrieb Florent Boucher:>>> Dear >>> Yundi,>> >>> this is a known limitation of ssh and rsh that does not pass the >>> interrupt>> signal to the remote host.>> Under LSF I had in the past a >>> solution. It was a specific rshlsf for doing>> this.>> Actually I use >>> either SGE or PBS on two different cluster and the problem>> exists.>> >>> You >>> will see that are not even able to suspend a running job.>> If some one >>> has >>> a solution, I will also appreciate.>> Regards>> Florent>>>> Le >>> 04/01/2012 >>> 21:57, Yundi Quan a ?crit :>>>>>> I'm working on a cluster using torque >>> queue system. I can directly ssh to>>> any nodes without using password. >>> When I use qdel( or canceljob) j >> obid to>>> terminate a running job, the>>> job will be terminated in >> the >> queue system. However, when I ssh to the>>> nodes, the job are still >> running. Does anyone know how to avoid this?>>>>>>>>>>>> >> _______________________________________________>>> Wien mailing list>>> >> Wien at zeus.theochem.tuwien.ac.at>>> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>>>>>>>> -->> >> ------------------------------------------------------------------------->> >> | Florent BOUCHER |>> |>> | Institut des >> Mat?riaux >> Jean Rouxel |Mailto:Florent.Boucher at cnrs-imn.fr>> |>> | 2, rue de >> la >> Houssini?re | Phone: (33) 2 40 37 39 24>> |>> | BP 32229 >> | Fax: (33) 2 40 37 39 95>> |>> | 44322 NANTES >> CEDEX 3 (FRANCE) |http://www.cnrs-imn.fr>> |>> >> >> ------------------------------------------------------------------------->>>>>>>> >> _______________________________________________>> Wien mailing list>> >> Wien at zeus.theoc >> hem.tuwien.ac.at>> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>>> -->> >> P.Blaha> >> --------------------------------------------------------------------------> >> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna> >> Phone: >> +43-1-58801-165300 FAX: +43-1-58801-165982> Email: >> blaha at theochem.tuwien.ac.at WWW:> >> http://info.tuwien.ac.at/theochem/> >> >> -------------------------------------------------------------------------->>> >> _______________________________________________> Wien mailing list> >> Wien at zeus.theochem.tuwien.ac.at> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien >>> >>> >>> -- Professor Laurence MarksDepartment of Materials Science and >>> EngineeringNorthwestern Universitywww.numis.northwestern.edu >>> 1-847-491-3996"Research is to see what everybody else has seen, and to think >>> whatnobody else has thought"Albert >>> Szent-Gyorgi_______________________________________________Wien mailing >>> listWien at >>> zeus.theochem.tuwien.ac.athttp://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien >> >> -- >> >> P.Blaha >> -------------------------------------------------------------------------- >> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna >> Phone: +43-1-58801-165300 FAX: +43-1-58801-165982 >> Email: blaha at theochem.tuwien.ac.at WWW: >> http://info.tuwien.ac.at/theochem/ >> -------------------------------------------------------------------------- >> >> _______________________________________________ >> Wien mailing list >> Wien at zeus.theochem.tuwien.ac.at >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien > > -- ------------------------------------------------------------------------- | Florent BOUCHER | | | Institut des Mat?riaux Jean Rouxel | Mailto:Florent.Boucher at cnrs-imn.fr | | 2, rue de la Houssini?re | Phone: (33) 2 40 37 39 24 | | BP 32229 | Fax: (33) 2 40 37 39 95 | | 44322 NANTES CEDEX 3 (FRANCE) | http://www.cnrs-imn.fr | -------------------------------------------------------------------------