W dniu 20.03.2012 20:48, Paul Larson pisze:
On Tue, Mar 20, 2012 at 12:39 PM, Zygmunt Krynicki
<[email protected] <mailto:[email protected]>> wrote:

    I think that we have two general problems with timeouts:

    1) We've pulled most of the initial values out of a hat

A hat of trial and error to try to come up with reasonable defaults.  We
don't want to be waiting for 5 hours for a reboot to happen if it's
failed, nor do we want to only give it 3 seconds.

    2) Timeouts are expressions, not constants.

Parameters actually, with defaults

    I'm very glad that with our health jobs we're actually looking at
    the constants we're using. I'd like to see a more scientific and
    thorough approach to this problem:

    -> Keep a shared google doc spreadsheet with timeouts for various
    actions that we put in our health jobs
    -> Track that per board
    -> Track the age and cycle count for each SD we purchase and
    allocate in the lab
    -> Benchmark the SD periodically

    Given that data we could turn timeout constants into timeout
    expressions that can use the following variables:

    $normalized_cpu_time
    $average_sd_speed

Unfortunately, there are more variables than that. In an idea world, I
would agree with you, but in the case of vexpress, we have an operation
that should normally take 30 min. taking more like 5 hours!  In this
case, the ARM lt is looking into the performance angle to see if there's
something that can be improved there.  What Dave is trying to get at
though, is that we support a timeout parameter for many other
operations, but not for deployment.

I don't see the problem. If it takes that long on vexpress then $average_sd_speed will be very very low. This will be per-device mind you.

First off, the reason we *don't* have a timeout parameter for this
operation is because the meaning is a bit ambiguous.  Other operations
are a bit simpler.   For instance, if I tell it the timeout for running
a test should be 3600 (seconds... 1 hour), it's clear that if the test
takes more than an hour to run from the time it does lava-test run... to
the time it gets a result back and lava-test exits, it should timeout.
For deployment though, what does the timeout mean? The time to download
the images? The time to create the image? the time to extract the
rootfs/bootfs tarballs? the time to push the boot image to the board?
rootfs? (userdata also for android?).  I suppose one thing we could do
is make it a *total* timeout.  So if we call the deploy action on
vexpress and give it a timeout of 5 hours, it first downloads the image
with a total timeout, calculates the time used so far, then for the next
step we subtract that from the total timeout, and so on.  The problem
here is obvious I think.  It should never ever take 5 hours to download
the image.  Even if it's not cached, it shouldn't take that long.  So we
could still wind up doing something insanely stupid there.

I think a better option is to actually apply it *just* to the portion
where we write the image to the card.  That's the only part that's done
through pexpect I think, so the only one where we can easily apply the
timeout anyway.  All deployments of image components would share the
timeout parameter, so we would only subtract the time spent for each
preceding part (boot.tgz, etc).  Timeouts are a pain, but unfortunately
we're always dealing with some operations that *could* hang at an
inopportune time, rather than fail with a proper error.

Thanks,
Paul Larson


--
Zygmunt Krynicki
Linaro Validation Team

_______________________________________________
linaro-validation mailing list
[email protected]
http://lists.linaro.org/mailman/listinfo/linaro-validation

Reply via email to