Hi. We've run into an issue with submitting jobs greater than 4096 on Torque/Maui combination. When submitting the following the job runs:
$ qsub -I -lnodes=170:ppn=24 When we go larger by one node: $ qsub -I -lnodes=171:ppn=24 The job is in the blocked queue with a state of Idle and the following message in checkjob: cannot select job 104 for partition DEFAULT (NodeCount) I did some searching and found information about number of jobs, but not much on number of tasks per job. I tried increasing the MAX_MTASK from the default of 4096 to a higher number of 16384 to support our core count on the cluster. This works, we're able to submit jobs greater than 4096, but Maui crashes within minutes after we're submitting jobs. These are the two parameters we're changing before rebuilding Maui: sed -i '/MMAX_JOB/ s/4096/8192/g' ./include/msched.h sed -i '/MAX_MTASK/ s/4096/16384/g' ./include/msched-common.h MMAX_JOB is one we have on the current build and it doesn't have any adverse effect on Maui, it's only when we increase MAX_MTASK. Is it possible we're missing another parameter change, or possibly a ulimit issue? Here's ulimit's on the system where Maui is running: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1196032 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 4096 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 1196032 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Thanks. Steve _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
