Andy Doan <[email protected]> writes:

> On 09/27/2012 07:08 AM, Dave Pigott wrote:
>> I noticed in the reports view, several jobs which have been stuck for a
>> while:

>> http://validation.linaro.org/lava-server/scheduler/job/33203
>> -------------------------------------------------------------------------------
>> origen02
>> ------------
>> A health check running for 4 days. Nothing in the log. I cancelled it,
>> but it was stuck in cancelling. So I went into admin, put it offline,
>> and then online to run a health check again. The job itself is still
>> showing as not finished. How do I track it down on control so that we
>> can kill it properly?

This looks like a job that failed to start properly.  There should be
stuff in the scheduler log about this...

>> http://validation.linaro.org/lava-server/scheduler/job/33382
>> -------------------------------------------------------------------------------
>> origen04
>> ------------
>> A regular job that failed, pushed its result bundle and then never quite
>> stopped running.  Same deal as 33203, but I can't get it to run its
>> health check. Any clues?

This is https://bugs.launchpad.net/lava-scheduler/+bug/1043059.  I've
gradually been adding more log output to home in on the cause but it's
pretty mysterious.  For this case though, the clean up is dead easy:
find the scheduler monitor process (ps aux | grep origen04) and send
SIGINT to it.  This seems to poke the monitor into noticing that the
dispatcher has exited.

It seems someone has cleaned this one up in a more aggressive manner, so
we'll need to fix up the status in the admin panel.

I don't know why the health job isn't being run.

>> http://validation.linaro.org/lava-server/scheduler/job/33372
>> -------------------------------------------------------------------------------
>> panda09
>> ------------
>> Same as origen04 - can't get health check to run.

Same same.

> I don't have the best answer for this, but I'll share what I do.
>
> 1) run some "ps -ef| grep" type commands to see if a scheduler or 
> dispatcher process is still running for that board. I then kill those.

Please try to kill them with SIGINT before the more violent signals.  It
really seems to help for some reason.

> 2) usually the job and board get left a bit out of sync. So I run my 
> "cancel-job.py" script on control:/home/doanac/lava-scripts. It looks like:
>
>   #!/srv/lava/instances/production/bin/py
>
>   import sys
>   import lava_scheduler_app.models as models
>
>   for jid in sys.argv[1:]:
>       jid = int(jid)
>       print "canceling: %d" % jid
>       job = models.TestJob.objects.get(pk=jid)
>       job.status = job.CANCELED
>       job.save()

You should also set the status of the device the job was running on
to IDLE.

> I suspect when mwhudson logs in he may have a better answer.

HTH, a bit.

Cheers,
mwh

_______________________________________________
linaro-validation mailing list
[email protected]
http://lists.linaro.org/mailman/listinfo/linaro-validation

Reply via email to