Hi I did some further testing and intensive logging and I came to the following info: maybe this log helps a bit more
/usr/local/maui/log/maui.log ------------------------- 05/16 16:02:19 MStatClearUsage([NONE],Idle) 05/16 16:02:19 MPolicyAdjustUsage(NULL,104,NULL,idle,PU,[ALL],1,NULL) 05/16 16:02:19 MPolicyAdjustUsage(NULL,104,NULL,idle,NULL,[ALL],1,NULL) 05/16 16:02:19 INFO: total jobs selected (ALL): 1/1 05/16 16:02:19 INFO: jobs selected: [000: 1] 05/16 16:02:19 MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE) 05/16 16:02:19 INFO: total jobs selected in partition ALL: 1/1 05/16 16:02:19 MQueueScheduleRJobs(Q) 05/16 16:02:19 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) 05/16 16:02:19 INFO: total jobs selected in partition ALL: 1/1 05/16 16:02:19 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) 05/16 16:02:19 INFO: total jobs selected in partition DEFAULT: 1/1 05/16 16:02:19 MQueueScheduleIJobs(Q,DEFAULT) 05/16 16:02:19 INFO: checking job 104(1) state: Idle (ex: Idle) 05/16 16:02:19 MJobSelectMNL(104,DEFAULT,NULL,MNodeList,NodeMap,MaxSpeed,2) ----------------- is this the reason why it fails? ------ 05/16 16:02:19 MReqGetFNL(104,0,DEFAULT,NULL,DstNL,NC,TC,2140000000,0) 05/16 16:02:19 INFO: 2 feasible tasks found for job 104:0 in partition DEFAULT (10 Needed) 05/16 16:02:19 INFO: inadequate feasible tasks found for job 104:0 in partition DEFAULT (2 < 10) 05/16 16:02:19 INFO: 5/16 16:02:19 MJobPReserve(104,DEFAULT,ResCount,ResCountRej) -------------------------------------------- 05/16 16:02:19 MJobReserve(104,Priority) 05/16 16:02:19 MPolicyGetEStartTime(104,ALL,SOFT,Time) 05/16 16:02:19 INFO: policy start time found for job 104 in 00:00:00 05/16 16:02:19 MJobGetEStartTime(104,NULL,NodeCount,TaskCount,MNodeList,1179324139) 05/16 16:02:19 ALERT: job 104 cannot run in any partition 05/16 16:02:19 ALERT: cannot create new reservation for job 104 (shape[1] 10) 05/16 16:02:19 ALERT: cannot create new reservation for job 104 05/16 16:02:19 MJobSetHold(104,16,1:00:00,NoResources,cannot create reservation for job '104' (intital reservation attempt) ) 05/16 16:02:19 ALERT: job '104' cannot run (deferring job for 3600 seconds) 05/16 16:02:19 WARNING: cannot reserve priority job '104' cannot locate adequate feasible tasks for job 104:0 --------------------------------- may this can help some more. Daniel Boone schreef: > > I tried some new parameters. > > print server output of qmgr > ---------------- > create queue batch > set queue batch queue_type = Execution > set queue batch resources_default.mem = 2000mb > set queue batch resources_default.nodes = 1 > set queue batch resources_default.pvmem = 16000mb > set queue batch resources_default.walltime = 06:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server managers = [EMAIL PROTECTED] > set server operators = [EMAIL PROTECTED] > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server pbs_version = 2.1.8 > ---------------------- > checkjob output: > ---------------------- > checking job 90 (RM job '90.em-research00') > > State: Idle EState: Deferred > Creds: user:abaqus group:users class:batch qos:DEFAULT > WallTime: 00:00:00 of 5:00:00 > SubmitTime: Tue May 15 11:59:03 > (Time Queued Total: 1:58:17 Eligible: 00:00:00) > > Total Tasks: 4 > > Req[0] TaskCount: 4 Partition: ALL > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 15G > Opsys: [NONE] Arch: [NONE] Features: [NONE] > Exec: '' ExecSize: 0 ImageSize: 0 > Dedicated Resources Per Task: PROCS: 1 MEM: 250M SWAP: 15G > NodeAccess: SHARED > TasksPerNode: 2 NodeCount: 2 > > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 0 > PartitionMask: [ALL] > SystemQueueTime: Tue May 15 13:00:06 > > Flags: RESTARTABLE > > job is deferred. Reason: NoResources (cannot create reservation for > job '90' (intital reservation attempt) > ) > Holds: Defer (hold reason: NoResources) > PE: 6.07 StartPriority: 57 > cannot select job 90 for partition DEFAULT (job hold active) > ------------------- > pbs-script: > ------------------- > > #!/bin/bash > #PBS -l nodes=2:ppn=2 > #PBS -l walltime=05:00:00 > #PBS -l mem=1000mb > #PBS -l vmem=7000mb > #PBS -j oe > #PBS -M [EMAIL PROTECTED] > #PBS -m bae > # Go to the directory from which you submitted the job > mkdir $PBS_O_WORKDIR > string="$PBS_O_WORKDIR/plus2gb.inp" > > scp 10.1.0.52:$string $PBS_O_WORKDIR > > cd $PBS_O_WORKDIR > #module load abaqus > # > /Apps/abaqus/Commands/abaqus job=plus2gb queue=abaqus4cpu > input=Standard_plus2gbyte.inp cpus=4 > --------------------------- > abaqus environment file. > -------------------------- > import os > os.environ['LAMRSH'] = 'ssh' > > max_cpus=6 > > mp_host_list=[['em-research00',3],['10.1.0.97',2]] > > > run_mode = BATCH > scratch = "/home/abaqus" > > queue_name=["cpu","abaqus4cpu"] > queue_cmd="qsub -r n -q batch -S /bin/bash -V -l nodes=1:ppn=1 %S" > cpu="qsub -r n -q batch -S /bin/bash -V -l nodes=1:ppn=2 %S" > abaqus4cpu="qsub -r n -q batch -S /bin/bash -V -l nodes=2:ppn=2 %S" > > pre_memory = "3000 mb" > standard_memory = "7000 mb" > > --------------------------- > but still no changes > > thanks for al the help until now. > rishi pathak schreef: > >> Also try in your job script file >> PBS -l pvmem=<amount of virtual memory> >> >> On 5/15/07, *rishi pathak* <[EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]>> wrote: >> >> I did not see any specific queue in th submit script >> have you specified the following for the queue you are using >> >> resources_default.mem #available ram >> resources_default.pvmem #virtual memory >> >> >> >> >> >> On 5/15/07, *Daniel Boone* <[EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]>> wrote: >> >> Hi >> >> I need to use the swap. I know I don't have enough RAM, but >> the job must >> be able to run. Even if it swaps a lot. >> Time is not an issue here. >> On 1 machine the job uses about 7.4GB swap. We don't have any >> other >> machines with more RAM to run it on. >> Otherwise the other option is to run the job outside >> torque/maui, but I >> rather don't do that. >> >> Can some tell me how to read the checkjob -v output, because I >> don't >> understand how to find errors in it. >> >> rishi pathak schreef: >> > Hi >> > system memory(RAM) available to per process is less than the >> requested >> > amount >> > It is not considering swap as an extention of RAM >> > Try with reduced system memory >> > >> > >> > >> > On 5/14/07, *Daniel Boone* <[EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]> >> > <mailto: [EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]>>> wrote: >> > >> > Hi >> > >> > I'm having the following problem. When I submit a very >> > memory-intensive(most swap) job, the job doesn't want to >> start. >> > It gives the error: cannot select job 62 for partition >> DEFAULT >> > (job hold >> > active) >> > But I don't understand what the error means. >> > >> > I run torque 2.1.8 with maui maui-3.2.6p19 >> > >> > checkjob -v returns the following: >> > ------------------- >> > checking job 62 (RM job '62.em-research00') >> > >> > State: Idle EState: Deferred >> > Creds: user:abaqus group:users class:batch qos:DEFAULT >> > WallTime: 00:00:00 of 6:00:00 >> > SubmitTime: Mon May 14 14:13:41 >> > (Time Queued Total: 1:53:39 Eligible: 00:00:00) >> > >> > Total Tasks: 4 >> > >> > Req[0] TaskCount: 4 Partition: ALL >> > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >> > Opsys: [NONE] Arch: [NONE] Features: [NONE] >> > Exec: '' ExecSize: 0 ImageSize: 0 >> > Dedicated Resources Per Task: PROCS: 1 MEM: 3875M >> > NodeAccess: SHARED >> > TasksPerNode: 2 NodeCount: 2 >> > >> > >> > IWD: [NONE] Executable: [NONE] >> > Bypass: 0 StartCount: 0 >> > PartitionMask: [ALL] >> > SystemQueueTime: Mon May 14 15:14:13 >> > >> > Flags: RESTARTABLE >> > >> > job is deferred. Reason: NoResources (cannot create >> reservation for >> > job '62' (intital reservation attempt) >> > ) >> > Holds: Defer (hold reason: NoResources) >> > PE: 19.27 StartPriority: 53 >> > cannot select job 62 for partition DEFAULT (job hold active) >> > ------------------------ >> > checknode of the two nodes:checking node em-research00 >> > ------------ >> > State: Idle (in current state for 2:31:21) >> > Configured Resources: PROCS: 3 MEM: 2010M SWAP: >> 33G DISK: 72G >> > >> > >> > Utilized Resources: DISK: 9907M >> > Dedicated Resources: [NONE] >> > Opsys: linux Arch: [NONE] >> > Speed: 1.00 Load: 0.000 >> > Network: [DEFAULT] >> > Features: [F] >> > Attributes: [Batch] >> > Classes: [batch 3:3] >> > >> > Total Time: 2:29:18 Up: 2:29:18 (100.00%) Active: >> 00:00:00 (0.00% ) >> > >> > Reservations: >> > NOTE: no reservations on node >> > >> > -------------------- >> > State: Idle (in current state for 2:31:52) >> > Configured Resources: PROCS: 2 MEM: 2012M SWAP: >> 17G DISK: 35G >> > Utilized Resources: DISK: 24G >> > Dedicated Resources: [NONE] >> > Opsys: linux Arch: [NONE] >> > Speed: 1.00 Load: 0.590 >> > Network: [DEFAULT] >> > Features: [NONE] >> > Attributes: [Batch] >> > Classes: [batch 2:2] >> > >> > Total Time: 2:29:49 Up: 2:29:49 ( 100.00%) Active: >> 00:00:00 ( 0.00%) >> > >> > Reservations: >> > NOTE: no reservations on node >> > ----------------- >> > The pbs scipt I'm using: >> > #!/bin/bash >> > #PBS -l nodes=2:ppn=2 >> > #PBS -l walltime=06:00:00 >> > #PBS -l mem=15500mb >> > #PBS -j oe >> > # Go to the directory from which you submitted the job >> > mkdir $PBS_O_WORKDIR >> > string="$PBS_O_WORKDIR/plus2gb.inp" >> > scp 10.1.0.52:$string $PBS_O_WORKDIR >> > #scp 10.1.0.52:$PBS_O_WORKDIR'/'$PBS_JOBNAME ./ >> > cd $PBS_O_WORKDIR >> > #module load abaqus >> > # >> > /Apps/abaqus/Commands/abaqus job=plus2gb queue=cpu2 >> > input=Standard_plus2gbyte.inp cpus=4 mem=15000mb >> > --------------------------- >> > If you need some extra info please let me know. >> > >> > Thank you >> > >> > _______________________________________________ >> > mauiusers mailing list >> > [email protected] >> <mailto:[email protected]> <mailto: >> [email protected] <mailto:[email protected]>> >> > http://www.supercluster.org/mailman/listinfo/mauiusers >> > >> > >> > >> > >> > -- >> > Regards-- >> > Rishi Pathak >> > National PARAM Supercomputing Facility >> > Center for Development of Advanced Computing(C-DAC) >> > Pune University Campus,Ganesh Khind Road >> > Pune-Maharastra >> >> >> >> >> -- >> Regards-- >> Rishi Pathak >> National PARAM Supercomputing Facility >> Center for Development of Advanced Computing(C-DAC) >> Pune University Campus,Ganesh Khind Road >> Pune-Maharastra >> >> >> >> >> -- >> Regards-- >> Rishi Pathak >> National PARAM Supercomputing Facility >> Center for Development of Advanced Computing(C-DAC) >> Pune University Campus,Ganesh Khind Road >> Pune-Maharastra >> > > > _______________________________________________ > mauiusers mailing list > [email protected] > http://www.supercluster.org/mailman/listinfo/mauiusers > > _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
