Re: [Mauiusers] Resources problem : cannot select job 62 for partition DEFAULT (job hold active)

Daniel Boone Thu, 17 May 2007 01:22:37 -0700

Hi

I did some further testing and intensive logging and I came to the
following info:
maybe this log helps a bit more


 /usr/local/maui/log/maui.log
-------------------------
05/16 16:02:19 MStatClearUsage([NONE],Idle)
05/16 16:02:19 MPolicyAdjustUsage(NULL,104,NULL,idle,PU,[ALL],1,NULL)
05/16 16:02:19 MPolicyAdjustUsage(NULL,104,NULL,idle,NULL,[ALL],1,NULL)
05/16 16:02:19 INFO:     total jobs selected (ALL): 1/1
05/16 16:02:19 INFO:     jobs selected:
[000:   1]
05/16 16:02:19
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE)
05/16 16:02:19 INFO:     total jobs selected in partition ALL: 1/1
05/16 16:02:19 MQueueScheduleRJobs(Q)
05/16 16:02:19
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
05/16 16:02:19 INFO:     total jobs selected in partition ALL: 1/1
05/16 16:02:19
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)
05/16 16:02:19 INFO:     total jobs selected in partition DEFAULT: 1/1
05/16 16:02:19 MQueueScheduleIJobs(Q,DEFAULT)
05/16 16:02:19 INFO:     checking job 104(1)  state: Idle (ex: Idle)
05/16 16:02:19 MJobSelectMNL(104,DEFAULT,NULL,MNodeList,NodeMap,MaxSpeed,2)
----------------- is this the reason why it fails? ------
05/16 16:02:19 MReqGetFNL(104,0,DEFAULT,NULL,DstNL,NC,TC,2140000000,0)
05/16 16:02:19 INFO:     2 feasible tasks found for job 104:0 in
partition DEFAULT (10 Needed)
05/16 16:02:19 INFO:     inadequate feasible tasks found for job 104:0
in partition DEFAULT (2 < 10)
05/16 16:02:19 INFO:  5/16 16:02:19
MJobPReserve(104,DEFAULT,ResCount,ResCountRej)
--------------------------------------------
05/16 16:02:19 MJobReserve(104,Priority)
05/16 16:02:19 MPolicyGetEStartTime(104,ALL,SOFT,Time)
05/16 16:02:19 INFO:     policy start time found for job 104 in 00:00:00
05/16 16:02:19
MJobGetEStartTime(104,NULL,NodeCount,TaskCount,MNodeList,1179324139)
05/16 16:02:19 ALERT:    job 104 cannot run in any partition
05/16 16:02:19 ALERT:    cannot create new reservation for job 104
(shape[1] 10)
05/16 16:02:19 ALERT:    cannot create new reservation for job 104
05/16 16:02:19 MJobSetHold(104,16,1:00:00,NoResources,cannot create
reservation for job '104' (intital reservation attempt)
)
05/16 16:02:19 ALERT:    job '104' cannot run (deferring job for 3600
seconds)
05/16 16:02:19 WARNING:  cannot reserve priority job '104'
   cannot locate adequate feasible tasks for job 104:0
---------------------------------

may this can help some more.


Daniel Boone schreef:
>
> I tried some new parameters.
>
> print server output of qmgr
> ----------------
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_default.mem = 2000mb
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.pvmem = 16000mb
> set queue batch resources_default.walltime = 06:00:00
> set queue batch enabled = True
> set queue batch started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server managers = [EMAIL PROTECTED]
> set server operators = [EMAIL PROTECTED]
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server pbs_version = 2.1.8
> ----------------------
> checkjob output:
> ----------------------
> checking job 90 (RM job '90.em-research00')
>
> State: Idle  EState: Deferred
> Creds:  user:abaqus  group:users  class:batch  qos:DEFAULT
> WallTime: 00:00:00 of 5:00:00
> SubmitTime: Tue May 15 11:59:03
>   (Time Queued  Total: 1:58:17  Eligible: 00:00:00)
>
> Total Tasks: 4
>
> Req[0]  TaskCount: 4  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 15G
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> Exec:  ''  ExecSize: 0  ImageSize: 0
> Dedicated Resources Per Task: PROCS: 1  MEM: 250M  SWAP: 15G
> NodeAccess: SHARED
> TasksPerNode: 2  NodeCount: 2
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 0
> PartitionMask: [ALL]
> SystemQueueTime: Tue May 15 13:00:06
>
> Flags:       RESTARTABLE
>
> job is deferred.  Reason:  NoResources  (cannot create reservation for
> job '90' (intital reservation attempt)
> )
> Holds:    Defer  (hold reason:  NoResources)
> PE:  6.07  StartPriority:  57
> cannot select job 90 for partition DEFAULT (job hold active)
> -------------------
> pbs-script:
> -------------------
>
> #!/bin/bash
> #PBS -l nodes=2:ppn=2
> #PBS -l walltime=05:00:00
> #PBS -l mem=1000mb
> #PBS -l vmem=7000mb
> #PBS -j oe
> #PBS -M [EMAIL PROTECTED]
> #PBS -m bae
> # Go to the directory from which you submitted the job
> mkdir $PBS_O_WORKDIR
> string="$PBS_O_WORKDIR/plus2gb.inp"
>
> scp 10.1.0.52:$string $PBS_O_WORKDIR
>
> cd $PBS_O_WORKDIR
> #module load abaqus
> #
> /Apps/abaqus/Commands/abaqus job=plus2gb queue=abaqus4cpu
> input=Standard_plus2gbyte.inp cpus=4
> ---------------------------
> abaqus environment file.
> --------------------------
> import os
> os.environ['LAMRSH'] = 'ssh'
>
> max_cpus=6
>
> mp_host_list=[['em-research00',3],['10.1.0.97',2]]
>
>
> run_mode = BATCH
> scratch  = "/home/abaqus"
>
> queue_name=["cpu","abaqus4cpu"]
> queue_cmd="qsub -r n -q batch -S /bin/bash -V -l nodes=1:ppn=1 %S"
> cpu="qsub -r n -q batch -S /bin/bash -V -l nodes=1:ppn=2 %S"
> abaqus4cpu="qsub -r n -q batch -S /bin/bash -V -l nodes=2:ppn=2 %S"
>
> pre_memory = "3000 mb"
> standard_memory = "7000 mb"
>
> ---------------------------
> but still no changes
>
> thanks for al the help until now.
> rishi pathak schreef:
>   
>> Also try in your job script file
>> PBS -l pvmem=<amount of virtual memory>
>>
>> On 5/15/07, *rishi pathak* <[EMAIL PROTECTED]
>> <mailto:[EMAIL PROTECTED]>> wrote:
>>
>>     I did not see any specific queue in th submit script
>>     have you specified the following for the queue you are using
>>
>>     resources_default.mem #available ram 
>>     resources_default.pvmem #virtual memory
>>
>>         
>>
>>
>>
>>     On 5/15/07, *Daniel Boone* <[EMAIL PROTECTED]
>>     <mailto:[EMAIL PROTECTED]>> wrote:
>>
>>         Hi
>>
>>         I need to use the swap. I know I don't have enough RAM, but
>>         the job must
>>         be able to run. Even if it swaps a lot.
>>         Time is not an issue here.
>>         On 1 machine the job uses about 7.4GB swap. We don't have any
>>         other
>>         machines with more RAM to run it on.
>>         Otherwise the other option is to run the job outside
>>         torque/maui, but I
>>         rather don't do that.
>>
>>         Can some tell me how to read the checkjob -v output, because I
>>         don't
>>         understand how to find errors in it.
>>
>>         rishi pathak schreef:
>>         > Hi
>>         > system memory(RAM) available to per process is less than the
>>         requested
>>         > amount
>>         > It is not considering swap as an extention of RAM
>>         > Try with reduced system memory
>>         >
>>         >
>>         >
>>         > On 5/14/07, *Daniel Boone* <[EMAIL PROTECTED]
>>         <mailto:[EMAIL PROTECTED]>
>>         > <mailto: [EMAIL PROTECTED]
>>         <mailto:[EMAIL PROTECTED]>>> wrote:
>>         >
>>         >     Hi
>>         >
>>         >     I'm having the following problem. When I submit a very
>>         >     memory-intensive(most swap) job, the job doesn't want to
>>         start.
>>         >     It gives the error: cannot select job 62 for partition
>>         DEFAULT
>>         >     (job hold
>>         >     active)
>>         >     But I don't understand what the error means.
>>         >
>>         >     I run torque 2.1.8 with maui maui-3.2.6p19
>>         >
>>         >     checkjob -v returns the following:
>>         >     -------------------
>>         >     checking job 62 (RM job '62.em-research00')
>>         >
>>         >     State: Idle  EState: Deferred
>>         >     Creds:  user:abaqus  group:users  class:batch  qos:DEFAULT
>>         >     WallTime: 00:00:00 of 6:00:00
>>         >     SubmitTime: Mon May 14 14:13:41
>>         >     (Time Queued  Total: 1:53:39  Eligible: 00:00:00)
>>         >
>>         >     Total Tasks: 4
>>         >
>>         >     Req[0]  TaskCount: 4  Partition: ALL
>>         >     Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>>         >     Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>>         >     Exec:  ''  ExecSize: 0  ImageSize: 0
>>         >     Dedicated Resources Per Task: PROCS: 1  MEM: 3875M
>>         >     NodeAccess: SHARED
>>         >     TasksPerNode: 2  NodeCount: 2
>>         >
>>         >
>>         >     IWD: [NONE]  Executable:  [NONE]
>>         >     Bypass: 0  StartCount: 0
>>         >     PartitionMask: [ALL]
>>         >     SystemQueueTime: Mon May 14 15:14:13
>>         >
>>         >     Flags:       RESTARTABLE
>>         >
>>         >     job is deferred.  Reason:  NoResources  (cannot create
>>         reservation for
>>         >     job '62' (intital reservation attempt)
>>         >     )
>>         >     Holds:    Defer  (hold reason:  NoResources)
>>         >     PE:  19.27  StartPriority:  53
>>         >     cannot select job 62 for partition DEFAULT (job hold active)
>>         >     ------------------------
>>         >     checknode of the two nodes:checking node em-research00
>>         >     ------------
>>         >     State:      Idle  (in current state for 2:31:21)
>>         >     Configured Resources: PROCS: 3  MEM: 2010M  SWAP:
>>         33G  DISK: 72G
>>         >
>>         >
>>         >     Utilized   Resources: DISK: 9907M
>>         >     Dedicated  Resources: [NONE]
>>         >     Opsys:         linux  Arch:      [NONE]
>>         >     Speed:      1.00  Load:       0.000
>>         >     Network:    [DEFAULT]
>>         >     Features:   [F]
>>         >     Attributes: [Batch]
>>         >     Classes:    [batch 3:3]
>>         >
>>         >     Total Time: 2:29:18  Up: 2:29:18 (100.00%)  Active:
>>         00:00:00 (0.00% )
>>         >
>>         >     Reservations:
>>         >     NOTE:  no reservations on node
>>         >
>>         >     --------------------
>>         >     State:      Idle  (in current state for 2:31:52)
>>         >     Configured Resources: PROCS: 2  MEM: 2012M  SWAP:
>>         17G  DISK: 35G
>>         >     Utilized   Resources: DISK: 24G
>>         >     Dedicated  Resources: [NONE]
>>         >     Opsys:         linux  Arch:      [NONE]
>>         >     Speed:      1.00  Load:       0.590
>>         >     Network:    [DEFAULT]
>>         >     Features:   [NONE]
>>         >     Attributes: [Batch]
>>         >     Classes:    [batch 2:2]
>>         >
>>         >     Total Time: 2:29:49  Up: 2:29:49 ( 100.00%)  Active:
>>         00:00:00 ( 0.00%)
>>         >
>>         >     Reservations:
>>         >     NOTE:  no reservations on node
>>         >     -----------------
>>         >     The pbs scipt I'm using:
>>         >     #!/bin/bash
>>         >     #PBS -l nodes=2:ppn=2
>>         >     #PBS -l walltime=06:00:00
>>         >     #PBS -l mem=15500mb
>>         >     #PBS -j oe
>>         >     # Go to the directory from which you submitted the job
>>         >     mkdir $PBS_O_WORKDIR
>>         >     string="$PBS_O_WORKDIR/plus2gb.inp"
>>         >     scp 10.1.0.52:$string $PBS_O_WORKDIR
>>         >     #scp 10.1.0.52:$PBS_O_WORKDIR'/'$PBS_JOBNAME ./
>>         >     cd $PBS_O_WORKDIR
>>         >     #module load abaqus
>>         >     #
>>         >     /Apps/abaqus/Commands/abaqus job=plus2gb queue=cpu2
>>         >     input=Standard_plus2gbyte.inp cpus=4 mem=15000mb
>>         >     ---------------------------
>>         >     If you need some extra info please let me know.
>>         >
>>         >     Thank you
>>         >
>>         >     _______________________________________________
>>         >     mauiusers mailing list
>>         >     [email protected]
>>         <mailto:[email protected]> <mailto:
>>         [email protected] <mailto:[email protected]>>
>>         >     http://www.supercluster.org/mailman/listinfo/mauiusers
>>         >
>>         >
>>         >
>>         >
>>         > --
>>         > Regards--
>>         > Rishi Pathak
>>         > National PARAM Supercomputing Facility
>>         > Center for Development of Advanced Computing(C-DAC)
>>         > Pune University Campus,Ganesh Khind Road
>>         > Pune-Maharastra
>>
>>
>>
>>
>>     -- 
>>     Regards--
>>     Rishi Pathak
>>     National PARAM Supercomputing Facility
>>     Center for Development of Advanced Computing(C-DAC)
>>     Pune University Campus,Ganesh Khind Road
>>     Pune-Maharastra
>>
>>
>>
>>
>> -- 
>> Regards--
>> Rishi Pathak
>> National PARAM Supercomputing Facility
>> Center for Development of Advanced Computing(C-DAC)
>> Pune University Campus,Ganesh Khind Road
>> Pune-Maharastra 
>>     
>
>
> _______________________________________________
> mauiusers mailing list
> [email protected]
> http://www.supercluster.org/mailman/listinfo/mauiusers
>
>   
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Re: [Mauiusers] Resources problem : cannot select job 62 for partition DEFAULT (job hold active)

Reply via email to