Thanks John, this looks like it strikes a good balance between reducing tegra load and getting false-positives in try runs.

Just for anybody following along though, it looks like bug 915465 (which deployed this change) is still in flux so things may have not completely settled yet. See also https://bugzilla.mozilla.org/show_bug.cgi?id=915465#c8 for the new trychooser syntax to enable testing on Tegras (although from comment 10 it looks like it may not be working properly yet).

Cheers,
kats

On 13-09-12 01:11 , John O'Duinn wrote:
hi kats (cross-posting to dev-b2g);

tl:dr; we think all is ok again, details below. To avoid this happening
again this week, we're changing tryserver to reduce the number of
Android-tests-run-on-tegra-by-default. If you specifically want tegra
testing on tryserver, you will need to state that when pushing to try.


Yes, load on try was unusually heavy today (the b2g workweek is in full
force). All other platforms, including our pool of panda Android4 test
boards were handling this heavy load just fine. However, our pool of
tegra boards is small (hard to get boards), had more-then-usual
percentage offline, and was not able to keep up. We were distracted by
an unrelated 64bit windows problem, but thats no excuse, we should have
detected this earlier. After this is all calm, we'll postmortem.

1) As of earlier this afternoon, 976 of the pending 1078 tegra jobs are
from try. This was no single "abuse of try", this was simply an
accumulation of a lot of pushes-to-try spread across the day.

2) We manually repoked every one of the offline tegras, and most are now
working correctly again. As of now, our tegra pool is much healthier
size again, back up to as-good-as-we-can-hope-for-with-poolsize, and
quickly chewing through the remaining backlog. At this time, we are down
to 240 pending jobs, and dropping fast.
http://builddata.pub.build.mozilla.org/reports/pending/pending.html
Details in https://bugzilla.mozilla.org/show_bug.cgi?id=915457 "Triage
tegras with no completed jobs within last 24 hours"

3) We cancelled all pending tegra test jobs on try that were waiting
over 6 hours (the longest was 10hours). Note: we did not cancel panda
test jobs, and we did not cancel tegra test jobs on other branches. If
we cancelled a tegra test job on try that you do still need run, please
let us know in #releng, and we'll sort it out.
Details in https://bugzilla.mozilla.org/show_bug.cgi?id=915481 "Stop try
tegra jobs pending for >6 hours"

4) We are changing the default when you push to try. Until now, by
default Try will generate Android builds and then run *all*
unittests+talos for Android2.2 (tegras) *and* Android4 (pandas). We are
now testing a change to default as follows:

4a) Android builds: no change, still built by default
4b) Android pandas tests: no change, still run unittest and talos by default
4c) Android tegras tests: unittests would not be run by default, but
talos would still be run by default. Because of details in how TryServer
works, talos would still be run on tegras by default, in order to keep
scheduling talos on pandas also. Details in
https://bugzilla.mozilla.org/show_bug.cgi?id=915465 "Pushes to try
should not trigger tegra test jobs by default". Again, this is just
changing the default: anyone who wants Android2.2  specific unittests
run on tegras on try can still get that by specify it explicitly when
pushing to try.

Note: this change is for default on try *only*. There is no change to
what any other (non-try) branches do for android testing on tegras,
those remain as-is.



As mhommey and joel discussed earlier in this thread, changing try
default to test on pandas, but not also run all tests on tegras, does
introduce a slim-but-non-zero risk of missing a problem that only a
tegra would have caught with that try push. Note that we are still
running tegra testing on non-try branches, as usual, so even if a
problem like this is missed on try, it will be caught the first time it
lands on any other branch that has Android coverage. After this
workweek, we can revisit whether this default setting needs to be reverted.


Let me know if you want any further info, ok?
John.
======
On 9/11/13 2:31 PM, Kartikaya Gupta wrote:
Earlier today the backlog on Android build jobs was on the order of
1300. It seems to be coming down a little now but for a while there I
was worried it was going to grow unboundedly. Try jobs from over 10
hours ago still have pending jobs - as I'm sure you all know, having a
10-hour turnaround on try pushes is something of a productivity killer.

I brought this up in #releng and one of the proposed solutions was to
try to tweak the prioritization of jobs between Try and Inbound a little
bit. I personally do like that Inbound jobs are prioritized above Try,
but perhaps they don't need to be prioritized quite so much. However,
changing this will affect a number of people, so it was suggested I
bring the discussion here to get other people's comments.

So, anybody have thoughts on a good way to solve this problem?

Cheers,
kats
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to