Reply to my own post...
So we nailed it down, well almost. The problem comes from the fact that we are
using the 'run job as real user'. When we switched back to normal 'all jobs run
as galaxy', it is just working fine.
When run as real user, the error we are getting from LSF is "can t open
lsf.sudoers" (well you need to hack in the drmaa C code to log this out). We
traced the runner operations with strace and the process really tries to read
this file (and it fails as LSF imposes that this file is owned by root and as
600 rights). The strange thing is that when we disable the 'run as user'
feature, this goes away. It might be the setuid() executed during the 'run as
real user' procedure that somehow forces the process to access this file (ie
imposed by LSF in some way) but we are lost. So we just run as galaxy now...
Does anybody run LSF with the "run as real user" feature on?
On 18 Dec 2012, at 10:47, Charles Girardot wrote:
> Hi all,
> We are currently changing how cluster management from PBSPro to LSF (LSF 7
> Update 6). We have a running Galaxy using drmaa with PBSPro (with the "job
> are submitted as real users" option). We expected an easy transition to LSF
> i.e. simply changing the drmaa implementation but of course, life is not that
> simple. So basically it is not working. We have tried with drmaa 1.0.4 and
> 1.0.3 (downloaded from http://sourceforge.net/projects/lsf-drmaa/ ).
> Before getting to the symptoms: does anybody successfully run Galaxy with
> drmaa and LSF 7 Update 6 ?
> Now the symptoms:
> - first we had an error saying something like "queued as Job <5160> is
> submitted to default queue <medium_priority>" is not an idea
> - we traced this in the drmaa C code and added a regex to actually extract
> the job id (if you are successfully running Galaxy with drmaa and LSF 7
> Update 6; did you also have to do this??);
> but then a new error came:
> - jobs are successfully sent to the LSF queue and submitted to a node
> - after few ms we get an error :
> galaxy.jobs.runners.drmaa DEBUG 2012-12-17 11:14:29,227 (1699) submitting
> with credentials: sauer [uid: 8483]
> galaxy.jobs.runners.drmaa DEBUG 2012-12-17 11:14:29,229 (1699) Job script for
> external submission is: /g/galaxy/galaxy-dev_data/pbs/1699.jt_json
> galaxy.jobs.runners.drmaa INFO 2012-12-17 11:14:29,464 (1699) queued as Job
> <5160> is submitted to default queue <medium_priority>.
> E #2bae [ 0.00] * call to lsb_openjobinfo returned with error 1:No
> matching job found mapped to 1040:Job does not exist in DRMs queue.
> galaxy.jobs.runners.drmaa DEBUG 2012-12-17 11:14:30,275 (1699/Job <5160> is
> submitted to default queue <medium_priority>.
> 5160) job left DRM queue with following message: code 18: lsb_openjobinfo:
> XDR operation error
> We are lost and the PBSPro license runs out on January 1 so we badly need to
> fix this...
> PS: Note that if we simply switch back to PBSPro, it is all working fine;
> which tells us that the Galaxy setup is ok.
> Thx for your help
> Charles Girardot
> European Molecular Biology Laboratory
> E. Furlong Group
> Tel: +49 6221 387 -8585 (V205) or 8433 (V320)
> Fax: +49-(0)6221-387-8166
> Email: charles.girar...@embl.de
> Room V205/V320
> Meyerhofstraße 1,
> 69117 Heidelberg, Germany
> Please keep all replies on the list by using "reply all"
> in your mail client. To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
European Molecular Biology Laboratory
E. Furlong Group
Tel: +49 6221 387 -8585 (V205) or 8433 (V320)
69117 Heidelberg, Germany
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at: