On Fri, Jan 30, 2015 at 9:24 AM, Shrum, Donald C <dcsh...@admin.fsu.edu> wrote:
> Hi all,
>
>
>
> I’ve configured one of our tools to submit jobs to our condor cluster.  I
> can see the job is routed to the condor runner:
>
>
>
> ==> handler4.log <==
>
> galaxy.jobs.handler DEBUG 2015-01-30 09:14:58,092 (508) Dispatching to
> condor runner
>
> galaxy.jobs DEBUG 2015-01-30 09:14:58,204 (508) Persisting job destination
> (destination id: condor)
>
>
>
> I can see that indeed the job is submitted to the condor cluster:
>
> [root@galaxy galaxy-dist]# condor_q
>
> -- Submitter: galaxy.local : <10.177.61.90:55265> : galaxy.local
>
> ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
>
>   21.0   galaxy          1/30 09:15   0+00:00:02 R  0   0.0  galaxy_509.sh
>
>
>
>
>
> The job begins to run:
>
> ==> handler4.log <==
>
> galaxy.jobs.runners.condor DEBUG 2015-01-30 09:15:03,827 (508/20) job is now
> running
>
> galaxy.jobs.runners.condor DEBUG 2015-01-30 09:15:05,183 (508/20) job has
> completed
>
>
>
> Galaxy is almost immediately removing the job working directory:
>
>
>
> Here is a snippet of the errors:
>
> ==> handler4.log <==
>
> galaxy.jobs.runners DEBUG 2015-01-30 09:15:06,372 (508/20) Unable to cleanup
> /panfs/storage.local/opt/galaxy-dist/database/pbs/galaxy_508.ec: [Errno 2]
> No such file or directory:
> '/panfs/storage.local/opt/galaxy-dist/database/pbs/galaxy_508.ec'
>
> galaxy.jobs DEBUG 2015-01-30 09:15:06,816 setting dataset state to ERROR
>
> galaxy.datatypes.metadata DEBUG 2015-01-30 09:15:06,996 Failed to cleanup
> MetadataTempFile temp files from
> /panfs/storage.local/galaxy-data/job_working_directory/000/508/metadata_out_HistoryDatasetAssociation_717_zrzVqh:
> No JSON object could be decoded
>
>
>
>
>
> Is it possible galaxy is attempting to query condor and see if the job is
> running, not finding anything and deciding that the job is not running and
> bailing out?
>
>
>
> I’ve reconstructed the process step by step using the logs but I have not
> been able to see exactly where the condor_submit command is shown so I can
> try to submit the same job manually.
>
>
>
> Does anyone have a suggestion for debugging this?

Two most relevant files would be:

https://bitbucket.org/galaxy/galaxy-central/src/tip/lib/galaxy/jobs/runners/condor.py
https://bitbucket.org/galaxy/galaxy-central/src/tip/lib/galaxy/jobs/runners/util/condor/__init__.py

The condor job logic strikes me as pretty brittle - it has always
worked for me when I have tested it but given how it is readiing logs
I would imagine very small changes to condor might cause it to fail.
So one thing to check is summarize_condor_log in
lib/galaxy/jobs/runners/util/condor/__init__.py - to see if that logic
matches the way your condor produces log files.

Given the symptoms - it might be also worth just sleeping for 5
seconds in condor_submit in
lib/galaxy/jobs/runners/util/condor/__init__.py and then verifying
that Galaxy is actually properly parsing the correct external id.

Here is an untested diff that adds the sleep and some more log
statements that might help:
https://gist.github.com/jmchilton/d0afd7242370642d5b43

If you are able to fix the problem - please let us know how so we can
fix it upstream.

-John

>
>
>
> Thanks,
>
> Don
>
> Florida State University
>
> Research Computing Center
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to