Hi Scott,

I see you've been working on this - it looks very comprehensive:
https://bitbucket.org/galaxy/galaxy-central/changeset/3d07a7800f9a

I can't test this just now, but if I run into any issues with the new
code later on, I'll be in touch.

Thanks,

Peter

On Fri, Sep 21, 2012 at 8:42 PM, Scott McManus <scottmcma...@gatech.edu> wrote:
>
> Thanks, Peter! Those are good suggestions. I'll look into it soon.
>
> -Scott
>
> ----- Original Message -----
>> Hi all,
>>
>> I've been running into some sporadic errors on our Cluster while
>> using the latest development Galaxy, and the error handling has
>> made this quite difficult to diagnose.
>>
>> For a user perspective, the jobs seem to run, get submitted to
>> the cluster, and finish, and the data looks OK via the 'eye' view
>> icon, but is red in the history with:
>>
>> 0 bytes
>> An error occurred running this job: info unavailable
>>
>> Furthermore, the stdout and stderr via the 'info' icon are blank.
>>
>> For watching the log (and adding more diagnosis lines), what
>> is happening is the job is being split and sent out to the cluster
>> fine, and starts running. If one of the tasks fails (and this seems
>> to be happening due to some sort of file system error on our
>> cluster), Galaxy spots this, and kills the rest of the jobs. That's
>> good.
>>
>> The problem is it fails to record any record of why the job died.
>> This is my suggestion for now - it would be nice to go further
>> and fill the info text show in the history peep as well?:
>>
>> $ hg diff
>> diff -r 4de1d566e9f8 lib/galaxy/jobs/__init__.py
>> --- a/lib/galaxy/jobs/__init__.py     Fri Sep 21 11:02:50 2012 +0100
>> +++ b/lib/galaxy/jobs/__init__.py     Fri Sep 21 11:59:27 2012 +0100
>> @@ -1061,6 +1061,14 @@
>>              log.error( "stderr for job %d is greater than 32K, only
>> first part will be logged to database" % task.id )
>>          task.stderr = stderr[:32768]
>>          task.command_line = self.command_line
>> +
>> +        if task.state == task.states.ERROR:
>> +            # If failed, will kill the other tasks in this job.
>> Record this
>> +            # task's stdout/stderr as should be useful to explain
>> the failure:
>> +            job = self.get_job()
>> +            job.stdout = ("(From one sub-task:)\n"
>> +task.stdout)[:32768]
>> +            job.stderr = ("(From one sub-task:)\n"
>> +task.stderr)[:32768]
>> +
>>          self.sa_session.flush()
>>          log.debug( 'task %d ended' % self.task_id )
>>
>>
>> Regards,
>>
>> Peter
>>
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to