This question pertains to a local galaxy install where the jobs are being 
submitted to a cluster running LSF.

I periodically get an error from a galaxy job ("Job output not returned from 
cluster") even though the job completes properly on the cluster. In researching 
this issue our systems administrator discovered that the error seems to be 
happening because of NFS caching. The problem is that the job finishes on the 
job node, but the Galaxy server doesn't see that reflected in the output file 
because of delays in the cache update in NFS. The only solution he discovered 
was turning the caching off on the file system. In our case it is not possible 
because it will lead to a performance hit that would be not acceptable on the 
shared cluster. Most other file systems on the cluster are "NFS" so moving the 
database/pbs folder to another file system is not an option.

I generally work around the NFS cache issue but now it is leading to a second 
problem. Since the job appears to be in some failed state to galaxy (it shows 
up as red in history) I can't seem to use the output file (even though it is 
there and I can see it using the "eye" icon) to move to the next step. The file 
attribute is set right.

I assume a possible solution may be to reset the "failed" flag on the history 
item. Would this need to be done in the database? Downloading and then 
re-uploading the result file (a 25+ GB SAM file in this case) may be a 
workaround but it is not very practical.

Any ideas/suggestions?


Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

Reply via email to