Re: [Linaro-validation] Health checks

Andy Doan Tue, 13 Nov 2012 08:32:00 -0800

On 11/12/2012 02:19 PM, Michael Hudson-Doyle wrote:

Andy Doan <[email protected]> writes:

On 11/09/2012 07:37 AM, Dave Pigott wrote:

Ignoring the one failure while I was getting tc2 up and running, we have the 
following:

------------
panda06
------------
http://validation.linaro.org/lava-server/scheduler/job/38176

The key part is in downloading root.tgz. It gets part way through and then we get 
"connection reset by peer" on every single retry until we fail.

I've put it back online to retest.


I think this is now our #1 failure issue in LAVA. We've looked at this
in the past, added debugging, made hypothesis. However, we really
haven't gotten to the bottom of this.

One data point I can add. When this happens, I've logged onto control
and run wget on the failed URL and it works. So, this doesn't appear to
be related to Apache or server load. I *think* I've also done wget's
from another system in the lab. So, I don't think its a network/router
thing either.


Thanks for checking this.  My gut already said "duff networking in the
master image" but nice to have some data.

We already have some retry logic there, but maybe we need something more
sophisticated? (my gut says "no")


Well.  Rebooting the master image would almost certainly fix it.  Don't
know how to detect when that is the thing to do though.  Interesting
that we didn't see this on staging at all -- is it concentrated on
particular boards?  It might have a hardware aspect.

It seems to have happened on panda02 a bit more recently. However, ingeneral I think its distributed across everything. Just last night wefailed downloading system.tar.bz2 for origen01.



_______________________________________________
linaro-validation mailing list
[email protected]
http://lists.linaro.org/mailman/listinfo/linaro-validation

Re: [Linaro-validation] Health checks

Reply via email to