It seems to be an NFS related issue. When I run a separate VM as an NFS server 
that hosts the galaxy data (files, job workdir, tmp, ftp), problems are gone. 
There’s probably an explanation for that, but I’m going to leave it at this.

cheers,
— 
Jorrit Boekel
Proteomics systems developer
BILS / Lehtiö lab
Scilifelab Stockholm, Sweden



On 07 May 2014, at 16:03, Jorrit Boekel <jorrit.boe...@scilifelab.se> wrote:

> I should probably mention that the data filesystem is NFS, exported by the 
> master from /mnt/galaxy/data and mounted on the worker. No separate 
> fileserver. Master is the one that hangs.
> 
> 
> cheers,
> — 
> Jorrit Boekel
> Proteomics systems developer
> BILS / Lehtiö lab
> Scilifelab Stockholm, Sweden
> 
> 
> 
> On 07 May 2014, at 15:57, Jorrit Boekel <jorrit.boe...@scilifelab.se> wrote:
> 
>> Dear all,
>> 
>> Has anyone tried running Galaxy on Ubuntu 14.04?
>> 
>> I’m trying a test setup on two virtual machines (worker+master) with a SLURM 
>> queue. Getting in strange problems when jobs finish, the master hangs, 
>> completely unresponsive with CPU at 100% (as reported by virt-manager, not 
>> by top). Only drmaa jobs seem to be affected. After hanging, a reboot shows 
>> the job is finished (and green in history).
>> 
>> It took me some debugging to figure out where things go wrong, but it seems 
>> it goes wrong when os.remove is called in lib/galaxy/datatypes/metadata.py 
>> in method cleanup_external_metadata. I can reproduce the problem by calling 
>> os.remove(metadatafile) by hand (in an interactive python shell) when using 
>> pdb to create a breakpoint just before the call. If I comment out the 
>> os.remove it runs on until it hits another delete call in 
>> lib/galaxy/jobs/__init__.py:
>> self.app.object_store.delete(self.get_job(), base_dir='job_work', 
>> entire_dir=True, dir_only=True, extra_dir=str(self.job_id))
>> It’s in the JobWrapper class in the cleanup() method. I should mention here 
>> that my galaxy version is a bit old since I’m running my own fork with local 
>> modifications on datatypes.
>> 
>> This object_store.delete also leads to a shutil.rmtree and os.remove 
>> function. So, remove calls to the filesystem seem to hang the whole thing, 
>> but only at this point in time. Rebooting and removing by hand is no 
>> problem, pdb-stepping also sometimes fixes it (but if I just press continue 
>> it hangs). I don’t know where to go from here with debugging, but has anyone 
>> seen anything similar? Right now it feels like it may be caused by timing 
>> rather than actual code problems.
>> 
>> cheers,
>> — 
>> Jorrit Boekel
>> Proteomics systems developer
>> BILS / Lehtiö lab
>> Scilifelab Stockholm, Sweden
>> 
>> 
>> 
> 


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to