On Nov 12, 2012, at 10:23 AM, Jorrit Boekel <jorrit.boe...@scilifelab.se> wrote:
> I was therefore looking for fault tolerance mechanisms in the galaxy project,
> which I seem to remember existed. Somehow I can't find anything about it
> right now though.
> I've tested a little bit, and it seems that as soon as one reboots instances
> or manually kills a job or task, the whole job is deleted and set to error
> state. I am not that knowledgeable in cluster computing, so I don't really
> know what handles what here, but this would be an ideal starting point to
> learn something about SGE and queue handling. Is there any mechanism in place
> that deals with node failure, network problems, etc? If not, would it be hard
> to implement?
You're correct in that currently jobs will be set to error and need to be
automatically rerun by the Galaxy user. There isn't anything in place for
automatic retry after spot instance failure, but this is definitely something
we plan to implement in the near term - a generalized retry and resume
mechanism will be useful for both cloud and local instances.
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at: