Re: [galaxy-dev] Problem with Torque/Maui

2011-06-02 Thread Nate Coraor
Hi Marco,

Thanks for all of the details, they make a big difference when
troubleshooting.  Sorry for the delay in response.

Can you ensure that you don't have any of the pbs_* options set in
universe_wsgi.ini?  I noticed that there are stagein/stageouts
set on the job.

You may also need to set $usecp in mom_priv/config on your execution
hosts to prevent pbs_mom from trying to rcp/scp the error and output
files back to galaxy1.  $usecp instructs pbs_mom to consider /path on
the execution host to be the same filesystem as on the submission host,
e.g.:

$usecp *:/mnt/equallogic1 /mnt/equallogic1

--nate

Marco Moretto wrote:
> Hi all,
> first of all, thanks to the Galaxy Team for this really useful software.
> Actually I don't really know if my problem is related with Galaxy or with
> Torque/Maui but I didn't find any solution looking in both Torque and Maui
> user lists, so I hope that some of you with more experience could give me
> some good advices. I'm trying to set up Galaxy in a small local virtual
> environment in order to test it. I started with 2 virtual ubuntu servers
> called galaxy1 and galaxy2. On galaxy1 I succesfully installed Galaxy,
> Apache, Torque and Maui. I'm using Postgres ad DBMS. It is installed on
> another "real" DB server. The virtual server galaxy2 is used as node. Galaxy
> is working like a charm locally but when I try to use Torque problems arise.
> Torque alone works correctly. That means that I can submit a job with qsub
> and everything works. The 2 virtual server (galaxy1 and galaxy2) share a
> directory (through NFS) in which I installed Galaxy following the "unified
> method" from the documentation.
> Now, as I said, Galaxy alone works, Torque/Maui alone works. When I put the
> two together nothing works.
> As a test I upload (using local runner) a gff file. Then I try to make a
> filter to the gff using "Filter and sort -> Extract features". When I run
> this tool the corresponding job on the Torque queue runs forever in Hold
> state. I report some output from diagnose programs:
> The diagnose -j reports the following:
> 
> Name  State Par Proc QOS WCLimit R  Min User
>  Group  Account  QueuedTime  Network  Opsys   ArchMem   Disk  Procs
>   Class Features
> 
> 29 Hold DEF1 DEF 1:00:00 01   galaxy
> galaxy-00:02:36   [NONE] [NONE] [NONE]>=0>=0NC0
> [batch:1] [NONE]
> 
> While the showq command reports
> 
> ACTIVE JOBS
> JOBNAMEUSERNAME  STATE  PROC   REMAINING
>  STARTTIME
> 
> 
>  0 Active Jobs   0 of1 Processors Active (0.00%)
> 
> IDLE JOBS--
> JOBNAMEUSERNAME  STATE  PROC WCLIMIT
>  QUEUETIME
> 
> 
> 0 Idle Jobs
> 
> BLOCKED JOBS
> JOBNAMEUSERNAME  STATE  PROC WCLIMIT
>  QUEUETIME
> 
> 29   galaxy   Hold 1 1:00:00  Wed May  4
> 03:56:40
> 
> 
> The checkjob reports:
> checking job 29
> 
> State: Hold
> Creds:  user:galaxy  group:galaxy  class:batch  qos:DEFAULT
> WallTime: 00:00:00 of 1:00:00
> SubmitTime: Wed May  4 03:56:40
>   (Time Queued  Total: 00:03:07  Eligible: 00:00:01)
> 
> The qstat -f reports
> 
> Job Id: 33.galaxy1.research.intra.ismaa.it
> Job_Name = 27_extract_features1_marco.more...@iasma.it
> Job_Owner = gal...@galaxy1.research.intra.ismaa.it
> job_state = W
> queue = batch
> server = galaxy1.research.intra.ismaa.it
> ctime = Wed May  4 04:56:36 2011
> Error_Path =
> galaxy1:/mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27.e
> 
> exec_host = galaxy2/0
> exec_port = 15003
> Execution_Time = Wed May  4 05:26:41 2011
> mtime = Wed May  4 04:56:37 2011
> Output_Path =
> galaxy1:/mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27.
> o
> qtime = Wed May  4 04:56:36 2011
> Resource_List.neednodes = 1
> Resource_List.nodect = 1
> Resource_List.nodes = 1
> Resource_List.walltime = 01:00:00
> stagein = /mnt/equallogic1/galaxy/tmp/dataset_18.dat@galaxy1
> :/mnt/equallog
> ic1/galaxy/galaxy-dist/database/files/000/dataset_18.dat,
> /mnt/equallogic1/galaxy/tmp/dataset_30.dat@galaxy1
> :/mnt/equallogic1/g
> alaxy/galaxy-dist/database/files/000/dataset_30.dat
> stageout =
> /mnt/equallogic1/galaxy/galaxy-dist/database/files/000/dataset_
> 30.dat@galaxy1
> :/mnt/equallogic1/galaxy/galaxy-dist/database/files/000/
> dataset_30.dat
> substate = 37
> Variable_List = PBS_O_QUEUE=batch,
> PBS_O_HOST=galaxy1.research.intra.ismaa.it
> euser = galaxy
> egroup = galaxy
> hashname = 33.galaxy1.research.intra.ismaa.it
> queue_rank = 33
> queue_type = E
> 
> StartDate: -00:03:06  Wed May  4 03:56:41
> Total Tasks: 1
> 
> Req[0]  TaskCount: 1  Partition: DEFAULT
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> 
> 
> IWD: [NONE]  Exec

[galaxy-dev] Problem with Torque/Maui

2011-05-04 Thread Marco Moretto
Hi all,
first of all, thanks to the Galaxy Team for this really useful software.
Actually I don't really know if my problem is related with Galaxy or with
Torque/Maui but I didn't find any solution looking in both Torque and Maui
user lists, so I hope that some of you with more experience could give me
some good advices. I'm trying to set up Galaxy in a small local virtual
environment in order to test it. I started with 2 virtual ubuntu servers
called galaxy1 and galaxy2. On galaxy1 I succesfully installed Galaxy,
Apache, Torque and Maui. I'm using Postgres ad DBMS. It is installed on
another "real" DB server. The virtual server galaxy2 is used as node. Galaxy
is working like a charm locally but when I try to use Torque problems arise.
Torque alone works correctly. That means that I can submit a job with qsub
and everything works. The 2 virtual server (galaxy1 and galaxy2) share a
directory (through NFS) in which I installed Galaxy following the "unified
method" from the documentation.
Now, as I said, Galaxy alone works, Torque/Maui alone works. When I put the
two together nothing works.
As a test I upload (using local runner) a gff file. Then I try to make a
filter to the gff using "Filter and sort -> Extract features". When I run
this tool the corresponding job on the Torque queue runs forever in Hold
state. I report some output from diagnose programs:
The diagnose -j reports the following:

Name  State Par Proc QOS WCLimit R  Min User
 Group  Account  QueuedTime  Network  Opsys   ArchMem   Disk  Procs
  Class Features

29 Hold DEF1 DEF 1:00:00 01   galaxy
galaxy-00:02:36   [NONE] [NONE] [NONE]>=0>=0NC0
[batch:1] [NONE]

While the showq command reports

ACTIVE JOBS
JOBNAMEUSERNAME  STATE  PROC   REMAINING
 STARTTIME


 0 Active Jobs   0 of1 Processors Active (0.00%)

IDLE JOBS--
JOBNAMEUSERNAME  STATE  PROC WCLIMIT
 QUEUETIME


0 Idle Jobs

BLOCKED JOBS
JOBNAMEUSERNAME  STATE  PROC WCLIMIT
 QUEUETIME

29   galaxy   Hold 1 1:00:00  Wed May  4
03:56:40


The checkjob reports:
checking job 29

State: Hold
Creds:  user:galaxy  group:galaxy  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 1:00:00
SubmitTime: Wed May  4 03:56:40
  (Time Queued  Total: 00:03:07  Eligible: 00:00:01)

The qstat -f reports

Job Id: 33.galaxy1.research.intra.ismaa.it
Job_Name = 27_extract_features1_marco.more...@iasma.it
Job_Owner = gal...@galaxy1.research.intra.ismaa.it
job_state = W
queue = batch
server = galaxy1.research.intra.ismaa.it
ctime = Wed May  4 04:56:36 2011
Error_Path =
galaxy1:/mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27.e

exec_host = galaxy2/0
exec_port = 15003
Execution_Time = Wed May  4 05:26:41 2011
mtime = Wed May  4 04:56:37 2011
Output_Path =
galaxy1:/mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27.
o
qtime = Wed May  4 04:56:36 2011
Resource_List.neednodes = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 01:00:00
stagein = /mnt/equallogic1/galaxy/tmp/dataset_18.dat@galaxy1
:/mnt/equallog
ic1/galaxy/galaxy-dist/database/files/000/dataset_18.dat,
/mnt/equallogic1/galaxy/tmp/dataset_30.dat@galaxy1
:/mnt/equallogic1/g
alaxy/galaxy-dist/database/files/000/dataset_30.dat
stageout =
/mnt/equallogic1/galaxy/galaxy-dist/database/files/000/dataset_
30.dat@galaxy1
:/mnt/equallogic1/galaxy/galaxy-dist/database/files/000/
dataset_30.dat
substate = 37
Variable_List = PBS_O_QUEUE=batch,
PBS_O_HOST=galaxy1.research.intra.ismaa.it
euser = galaxy
egroup = galaxy
hashname = 33.galaxy1.research.intra.ismaa.it
queue_rank = 33
queue_type = E

StartDate: -00:03:06  Wed May  4 03:56:41
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
PE:  1.00  StartPriority:  1
cannot select job 29 for partition DEFAULT (non-idle state 'Hold')

and finally the tracejob reports

/var/spool/torque/server_priv/accounting/20110504: Permission denied
/var/spool/torque/mom_logs/20110504: No such file or directory
/var/spool/torque/sched_logs/20110504: No such file or directory

Job: 33.galaxy1.research.intra.ismaa.it

05/04/2011 04:56:36  Senqueuing into batch, state 1 hop 1
05/04/2011 04:56:36  SJob Queued at request of
gal...@galaxy1.research.intra.ismaa.it, owner =
gal...@galaxy1.research.intra.ismaa.it, job name =
  27_extract_features1_marco.more...@iasma.it, queue
= batch
05/04/2011 04:56:37  SJob Run at request of
gal...@galaxy1.research.intra.ismaa.it
05/04/2011 04:56:41  SEmail 's' to
gal...@galax