I have looked into this matter a little bit more, and it looks like this
- tasked job is split
- tasks commands are sent to workers (I am running 8-core high cpu extra
large workers on EC2)
- per task, worker runs env.sh for the respective tool
- per task, worker runs scripts/extract_dataset_part.py
- this scripts issues import statements (ones forsimplejson and
galaxy.model.mapping have caused me problems)
- which lead to unzipping .so libraries from python eggs into the nodes'
- this runs into lib/pkg_resources.py and its _bypass_ensure_directory
method that creates the temporary dir for the egg unzip
- since there are 8 processes on the node, sometimes this method tries
to mkdir a directory that was just made by the previous process after
That last point is my guessing. I don't really know how to solve this in
a non-hackish way, so until someone finds out, I may use reading from a
'eggs_extracted.txt' file to determine if the eggs have been extracted.
And locking the file when writing to it of course.
On 09/14/2012 10:57 AM, Jorrit Boekel wrote:
I am running galaxy-dist on Amazon EC2 through Cloudman, and am using
the enable_tasked_jobs to run jobs in parallel. Yes, I know it's not
recommended in production. My jobs usually get split in 72 parts, and
sometimes (but not always, maybe in 30-50% of cases), errors are
returned concerning the python egg cache, usually:
[Errno 17] File exists: '/home/galaxy/.python-eggs'
or something like
[Errno 17] File exists:
The errors arise AFAIK from when scripts/extract_dataset_part.py is
run. I am guessing that the tmp python egg dir is created for every
task of the mentioned 72, that they sometimes coincide and that this
leads to an error.
I would like to solve this problem, but before doing so, I'd like to
know if someone else has already fixed it in a galaxy-central changeset.
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at: