Hi,
I am the administrator of a really small cluster dedicated to scientific 
computing.
On this cluster TORQUE is run as resource manager and MAUI as scheduler.
The installed versions are:
TORQUE 2.4.6
MAUI 3.3

All used to work fine but now some of the users have started launching their 
jobs specifying the node name on the command line.
They are doing this because they want to spread the jobs on different nodes 
as much as possible. Their jobs rely heavily on disk I/O and they are
concerned about the possibility of having a performance hit if the scheduler 
decides to put more
than one of these jobs per node. Therefore they want to take care of the node 
allocation by themselves (if there is a better way to do that, please tell me).

For example, one of the job was launched with:

qsub -l pmem=500mb -l nodes=yyyyy:ppn=1 a.sh

They are usually launching their jobs in bunches.
I think that this is a rather interesting point, since it seems that I cannot 
replicate the problem I am going to describe by
submitting a single job alone.

The problem with these jobs is that, as soon as they are submitted and
as _long_ as they run, MAUI complains continuously (about once per second),
notifying me with the following message:

"JOBCORRUPTION: job 'xxxxx' has the following idle node(s)"

A quick look at the log file turns out the following interesting lines:

MPBSJobUpdate(89704,89704.masternode.zzzz.zzzz.zzzz.zzzz,TaskList,0)
ALERT:    RM state corruption.  job '89704' has idle node 'yyyyy' allocated 
(node forced to active state)
MStatUpdateActiveJobUsage(89704)
MResDestroy(89704)
MResChargeAllocation(89704,2)
MResJCreate(89704,MNodeList,-7:22:17,ActiveJob,Res)
MSysRegEvent(JOBCORRUPTION:  job '89704' has the following idle node(s) 
allocated: 'yyyyy' ,0,0,1)
MSysLaunchAction(ASList,1)
INFO:     action 'notify' launched with message 'JOBCORRUPTION:  job '89704' 
has the following idle node(s) allocated:   'yyyyy' '

Apart from the error message, all seems to work correctly. 

I have found in the mailing archives that this error can happen sometimes but 
it is generally limited to the first few moments after the job
submission. On the other hand, I actually observe it for all the job duration.
Can you help me pinpointing the cause of this annoying problem? Is it due to an 
incorrect configuration? Is it a bug in TORQUE/MAUI?

Thanks.

Best regards.
-- 
Emanuele A. Bagnaschi

Attachment: pgppGeqMKG37j.pgp
Description: PGP signature

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to