Hi, I am the administrator of a really small cluster dedicated to scientific computing. On this cluster TORQUE is run as resource manager and MAUI as scheduler. The installed versions are: TORQUE 2.4.6 MAUI 3.3
All used to work fine but now some of the users have started launching their jobs specifying the node name on the command line. They are doing this because they want to spread the jobs on different nodes as much as possible. Their jobs rely heavily on disk I/O and they are concerned about the possibility of having a performance hit if the scheduler decides to put more than one of these jobs per node. Therefore they want to take care of the node allocation by themselves (if there is a better way to do that, please tell me). For example, one of the job was launched with: qsub -l pmem=500mb -l nodes=yyyyy:ppn=1 a.sh They are usually launching their jobs in bunches. I think that this is a rather interesting point, since it seems that I cannot replicate the problem I am going to describe by submitting a single job alone. The problem with these jobs is that, as soon as they are submitted and as _long_ as they run, MAUI complains continuously (about once per second), notifying me with the following message: "JOBCORRUPTION: job 'xxxxx' has the following idle node(s)" A quick look at the log file turns out the following interesting lines: MPBSJobUpdate(89704,89704.masternode.zzzz.zzzz.zzzz.zzzz,TaskList,0) ALERT: RM state corruption. job '89704' has idle node 'yyyyy' allocated (node forced to active state) MStatUpdateActiveJobUsage(89704) MResDestroy(89704) MResChargeAllocation(89704,2) MResJCreate(89704,MNodeList,-7:22:17,ActiveJob,Res) MSysRegEvent(JOBCORRUPTION: job '89704' has the following idle node(s) allocated: 'yyyyy' ,0,0,1) MSysLaunchAction(ASList,1) INFO: action 'notify' launched with message 'JOBCORRUPTION: job '89704' has the following idle node(s) allocated: 'yyyyy' ' Apart from the error message, all seems to work correctly. I have found in the mailing archives that this error can happen sometimes but it is generally limited to the first few moments after the job submission. On the other hand, I actually observe it for all the job duration. Can you help me pinpointing the cause of this annoying problem? Is it due to an incorrect configuration? Is it a bug in TORQUE/MAUI? Thanks. Best regards. -- Emanuele A. Bagnaschi
pgppGeqMKG37j.pgp
Description: PGP signature
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
