I should probably mention that the data filesystem is NFS, exported by the 
master from /mnt/galaxy/data and mounted on the worker. No separate fileserver. 
Master is the one that hangs.

Jorrit Boekel
Proteomics systems developer
BILS / Lehtiö lab
Scilifelab Stockholm, Sweden

On 07 May 2014, at 15:57, Jorrit Boekel <jorrit.boe...@scilifelab.se> wrote:

> Dear all,
> Has anyone tried running Galaxy on Ubuntu 14.04?
> I’m trying a test setup on two virtual machines (worker+master) with a SLURM 
> queue. Getting in strange problems when jobs finish, the master hangs, 
> completely unresponsive with CPU at 100% (as reported by virt-manager, not by 
> top). Only drmaa jobs seem to be affected. After hanging, a reboot shows the 
> job is finished (and green in history).
> It took me some debugging to figure out where things go wrong, but it seems 
> it goes wrong when os.remove is called in lib/galaxy/datatypes/metadata.py in 
> method cleanup_external_metadata. I can reproduce the problem by calling 
> os.remove(metadatafile) by hand (in an interactive python shell) when using 
> pdb to create a breakpoint just before the call. If I comment out the 
> os.remove it runs on until it hits another delete call in 
> lib/galaxy/jobs/__init__.py:
> self.app.object_store.delete(self.get_job(), base_dir='job_work', 
> entire_dir=True, dir_only=True, extra_dir=str(self.job_id))
> It’s in the JobWrapper class in the cleanup() method. I should mention here 
> that my galaxy version is a bit old since I’m running my own fork with local 
> modifications on datatypes.
> This object_store.delete also leads to a shutil.rmtree and os.remove 
> function. So, remove calls to the filesystem seem to hang the whole thing, 
> but only at this point in time. Rebooting and removing by hand is no problem, 
> pdb-stepping also sometimes fixes it (but if I just press continue it hangs). 
> I don’t know where to go from here with debugging, but has anyone seen 
> anything similar? Right now it feels like it may be caused by timing rather 
> than actual code problems.
> cheers,
> — 
> Jorrit Boekel
> Proteomics systems developer
> BILS / Lehtiö lab
> Scilifelab Stockholm, Sweden

Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

To search Galaxy mailing lists use the unified search at:

Reply via email to