Andy Doan <[email protected]> writes: > On 09/27/2012 07:08 AM, Dave Pigott wrote: >> I noticed in the reports view, several jobs which have been stuck for a >> while:
>> http://validation.linaro.org/lava-server/scheduler/job/33203 >> ------------------------------------------------------------------------------- >> origen02 >> ------------ >> A health check running for 4 days. Nothing in the log. I cancelled it, >> but it was stuck in cancelling. So I went into admin, put it offline, >> and then online to run a health check again. The job itself is still >> showing as not finished. How do I track it down on control so that we >> can kill it properly? This looks like a job that failed to start properly. There should be stuff in the scheduler log about this... >> http://validation.linaro.org/lava-server/scheduler/job/33382 >> ------------------------------------------------------------------------------- >> origen04 >> ------------ >> A regular job that failed, pushed its result bundle and then never quite >> stopped running. Same deal as 33203, but I can't get it to run its >> health check. Any clues? This is https://bugs.launchpad.net/lava-scheduler/+bug/1043059. I've gradually been adding more log output to home in on the cause but it's pretty mysterious. For this case though, the clean up is dead easy: find the scheduler monitor process (ps aux | grep origen04) and send SIGINT to it. This seems to poke the monitor into noticing that the dispatcher has exited. It seems someone has cleaned this one up in a more aggressive manner, so we'll need to fix up the status in the admin panel. I don't know why the health job isn't being run. >> http://validation.linaro.org/lava-server/scheduler/job/33372 >> ------------------------------------------------------------------------------- >> panda09 >> ------------ >> Same as origen04 - can't get health check to run. Same same. > I don't have the best answer for this, but I'll share what I do. > > 1) run some "ps -ef| grep" type commands to see if a scheduler or > dispatcher process is still running for that board. I then kill those. Please try to kill them with SIGINT before the more violent signals. It really seems to help for some reason. > 2) usually the job and board get left a bit out of sync. So I run my > "cancel-job.py" script on control:/home/doanac/lava-scripts. It looks like: > > #!/srv/lava/instances/production/bin/py > > import sys > import lava_scheduler_app.models as models > > for jid in sys.argv[1:]: > jid = int(jid) > print "canceling: %d" % jid > job = models.TestJob.objects.get(pk=jid) > job.status = job.CANCELED > job.save() You should also set the status of the device the job was running on to IDLE. > I suspect when mwhudson logs in he may have a better answer. HTH, a bit. Cheers, mwh _______________________________________________ linaro-validation mailing list [email protected] http://lists.linaro.org/mailman/listinfo/linaro-validation
