I've observed the following problem:

1) preemptible job "A" from our "background" Torque queue is running on a 
node. This job has specified -l nodes=1:ppn=8, a full node.

2) job A is sent SIGSTOP and preempting job B from the "high" queue starts. It 
has also specified -l nodes=1:ppn=8.

3) job B is short lived and exits before the next maui iteration

4) on the next iteration job A is *not* sent a SIGCONT to resume

5) instead queued job C with -l nodes=1:ppn=8 from the background queue is 
started on the same node.  Job B is gone and job A is still suspended.

The expected behavior, of course, is that job A would be resumed.

Here are some of the vitals from maui.cfg:
'\' added for readability

RMPOLLINTERVAL                  00:00:45
NODEACCESSPOLICY                SHARED
NODEAVAILABILITYPOLICY          COMBINED
NODEALLOCATIONPOLICY            PRIORITY
PREEMPTPOLICY                   SUSPEND
BACKFILLPOLICY                  BESTFIT
BACKFILLMETRIC                  PROCSECONDS
BFCHUNKDURATION                 05:00
BFCHUNKSIZE                     4
RESERVATIONPOLICY               CURRENTHIGHEST
RESERVATIONDEPTH[0]             256
RESERVATIONQOSLIST[0]           preemptor
RESERVATIONDEPTH[1]             384
RESERVATIONQOSLIST[1]           background
DEFERTIME                       0
DEFERCOUNT                      1000
QOSCFG[background]              QFLAGS=DEDICATED:PREEMPTEE
QOSCFG[preemptor]               QFLAGS=DEDICATED:PREEMPTOR
CLASSCFG[background]            QDEF=background PRIORITY=9 \
                                 MAXPROC=2240 MAXNODE=280
CLASSCFG[high]                  QDEF=preemptor PRIORITY=10000 \
                                 MAXPROC=1024,1440 MAXNODE=128,180

Notice that the default NODEACCESSPOLICY is SHARED, but the QFLAGS should 
override this with DEDICATED for the two QOSCFG's.  We have other queues that 
omit QFLAGS=DEDICATED and pack np single-core jobs onto a node.

I can't say this is fully reproducible, but it a short-lived job can trigger 
this behavior.

The total number of jobs across all queues is approximately 600.

Maui 3.3.1.

Any ideas?  Any thoughts on further diagnosis?  The log level is currently 
somewhat low, so I'm only seeing the MRMJob{Start,Suspend,Resume} actions.

// Steve
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to