Hi all,
first of all, thanks to the Galaxy Team for this really useful software.
Actually I don't really know if my problem is related with Galaxy or with
Torque/Maui but I didn't find any solution looking in both Torque and Maui
user lists, so I hope that some of you with more experience could give me
some good advices. I'm trying to set up Galaxy in a small local virtual
environment in order to test it. I started with 2 virtual ubuntu servers
called galaxy1 and galaxy2. On galaxy1 I succesfully installed Galaxy,
Apache, Torque and Maui. I'm using Postgres ad DBMS. It is installed on
another "real" DB server. The virtual server galaxy2 is used as node. Galaxy
is working like a charm locally but when I try to use Torque problems arise.
Torque alone works correctly. That means that I can submit a job with qsub
and everything works. The 2 virtual server (galaxy1 and galaxy2) share a
directory (through NFS) in which I installed Galaxy following the "unified
method" from the documentation.
Now, as I said, Galaxy alone works, Torque/Maui alone works. When I put the
two together nothing works.
As a test I upload (using local runner) a gff file. Then I try to make a
filter to the gff using "Filter and sort -> Extract features". When I run
this tool the corresponding job on the Torque queue runs forever in Hold
state. I report some output from diagnose programs:
The diagnose -j reports the following:

Name                  State Par Proc QOS     WCLimit R  Min     User
 Group  Account  QueuedTime  Network  Opsys   Arch    Mem   Disk  Procs
  Class Features

29                     Hold DEF    1 DEF     1:00:00 0    1   galaxy
galaxy        -    00:02:36   [NONE] [NONE] [NONE]    >=0    >=0    NC0
[batch:1] [NONE]

While the showq command reports

ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING
 STARTTIME


     0 Active Jobs       0 of    1 Processors Active (0.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT
 QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT
 QUEUETIME

29                   galaxy       Hold     1     1:00:00  Wed May  4
03:56:40


The checkjob reports:
checking job 29

State: Hold
Creds:  user:galaxy  group:galaxy  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 1:00:00
SubmitTime: Wed May  4 03:56:40
  (Time Queued  Total: 00:03:07  Eligible: 00:00:01)

The qstat -f reports

Job Id: 33.galaxy1.research.intra.ismaa.it
    Job_Name = 27_extract_features1_marco.more...@iasma.it
    Job_Owner = gal...@galaxy1.research.intra.ismaa.it
    job_state = W
    queue = batch
    server = galaxy1.research.intra.ismaa.it
    ctime = Wed May  4 04:56:36 2011
    Error_Path =
galaxy1:/mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27.e

    exec_host = galaxy2/0
    exec_port = 15003
    Execution_Time = Wed May  4 05:26:41 2011
    mtime = Wed May  4 04:56:37 2011
    Output_Path =
galaxy1:/mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27.
        o
    qtime = Wed May  4 04:56:36 2011
    Resource_List.neednodes = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    Resource_List.walltime = 01:00:00
    stagein = /mnt/equallogic1/galaxy/tmp/dataset_18.dat@galaxy1
:/mnt/equallog
        ic1/galaxy/galaxy-dist/database/files/000/dataset_18.dat,
        /mnt/equallogic1/galaxy/tmp/dataset_30.dat@galaxy1
:/mnt/equallogic1/g
        alaxy/galaxy-dist/database/files/000/dataset_30.dat
    stageout =
/mnt/equallogic1/galaxy/galaxy-dist/database/files/000/dataset_
        30.dat@galaxy1
:/mnt/equallogic1/galaxy/galaxy-dist/database/files/000/
        dataset_30.dat
    substate = 37
    Variable_List = PBS_O_QUEUE=batch,
        PBS_O_HOST=galaxy1.research.intra.ismaa.it
    euser = galaxy
    egroup = galaxy
    hashname = 33.galaxy1.research.intra.ismaa.it
    queue_rank = 33
    queue_type = E

StartDate: -00:03:06  Wed May  4 03:56:41
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
PE:  1.00  StartPriority:  1
cannot select job 29 for partition DEFAULT (non-idle state 'Hold')

and finally the tracejob reports

/var/spool/torque/server_priv/accounting/20110504: Permission denied
/var/spool/torque/mom_logs/20110504: No such file or directory
/var/spool/torque/sched_logs/20110504: No such file or directory

Job: 33.galaxy1.research.intra.ismaa.it

05/04/2011 04:56:36  S    enqueuing into batch, state 1 hop 1
05/04/2011 04:56:36  S    Job Queued at request of
gal...@galaxy1.research.intra.ismaa.it, owner =
gal...@galaxy1.research.intra.ismaa.it, job name =
                          27_extract_features1_marco.more...@iasma.it, queue
= batch
05/04/2011 04:56:37  S    Job Run at request of
gal...@galaxy1.research.intra.ismaa.it
05/04/2011 04:56:41  S    Email 's' to
gal...@galaxy1.research.intra.ismaa.it failed: Child process 'sendmail -f
adm gal...@galaxy1.research.intra.ismaa.it'
                          returned 127 (errno 10:No child processes)

The only clear thing to me is that after the submission the scheduler puts
it in a Hold state. But I cannot understand why. I also try to run the
Galaxy-generated sh script with qsub. Following the Galaxy log:
galaxy.jobs.runners.pbs DEBUG 2011-05-04 04:56:36,748 (27) submitting file
/mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27.sh

I copied and run the 27.sh script with the command:
qsub 27.sh
And the job runs correctly. So what it is not clear to me is if the problem
is related to Torque/Maui or is related to the way in which the job is
submitted from Galaxy.

Sorry for the very long e-mail and thank you very much for any help.

---
Marco
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to