Hello Celso, Celso Providelo [2015-05-26 10:29 -0300]: > On Tue, May 26, 2015 at 3:30 AM, Martin Pitt <[email protected]> wrote: > > However, in terms of prioritization this is by far not urgent. The > > current numerous problems that we have with our CI autopkgtest cloud > > infrastructure are far more important/urgent: missing support for all > > architectures except amd64, a *lot* slower, no daily base images, they > > don't dynamically scale, inefficient controller vs. testbed > > allocation, not using ScalingStack, the layout of results in swift got > > totally broken, we still use Jenkins in between, frequent failures to > > start tests. > I've just split this thread so we can organise and discuss your > concerns about adt-cloud separately from the internet-access-for-tests > endless epic.
Sure, sounds good! Related to that, I already asked twice on IRC, but let's ask here again: Where does one report bugs against adt-nova? https://bugs.launchpad.net/adt-cloud-worker does not exist (can we perhaps enable that?) and it seems that this is unrelated to lp:uci-engine, so reporting bugs there wouldn't work either? > I think we should clarify what are actual *problems* (read > bugs/regressions) and what are missing features and future work. Those > are very different things with different priorities and I am quite > certain we agree on this. Right, of course. I didn't say that all of the above were regressions but we should keep in mind that the move to nova was done so that we can actually see some *improvements* towards the rather busted situation that we had (and still have) with the bunch of static and manually maintained testbed machines and Jenkins. The main advantage that we have now is that we got roughly twice the x86 bandwidth now (as the static machines now only run i386 while nova runs amd64), but the problems with manual maintenance, Jenkins, etc. still remain. Some expanion on my quick list above: * missing support for all architectures except amd64 → as long as we have that, maintenance has actually become worse, not better; I'd say the importance of this is rather high, so it's kind of a regression now in terms of manpower * a *lot* slower → regression (see below), but not very urgent * no daily base images → regression, probably this contributes most to the speed decrease it also potentially makes tests more unstable low urgency right now, but this can quickly become medium if the additional overhead of dist-upgrading from an ancient base image and then rebooting for each and every test becomes unbearable * they don't dynamically scale → this seems to be a design problem; testbeds are meant to be allocated and dropped as needed, we shouldn't have to pre-create n controllers/workers, let them do nothing most of the time, and have large queues whenever gcc/glibc/glib2.0/etc. hit urgency: medium; not a regression, but also not quite a credible/useful cloud story either :-) * inefficient controller vs. testbed allocation → probabl part of the previous point -- currently (I think) we have 20 controller nodes which just waste cloud resources, do mostly nothing; a controller can easily drive many dozen parallel adt-run runs as it essentially just does a bunch of "nova boot" commands and shovels logs from testbeds into swift urgency: medium (see above) * not using ScalingStack → this is probably the blocker for full arch support? so "high" * the layout of results in swift got totally broken not sure where that came from; the data structure was designed carefully between the Debian CI team, Vincent Ladeuil and me in https://wiki.debian.org/debci/DistributedSpec so that we can drop all the hideous mechanics on snakefruit and tachash and make britney directly poll swift (or perhaps some kind of mirror of it) for incoming results. uci-engine got that right, and results were in a swift (pseudo-) directory like /trusty/amd64/libp/libpng/20140321_130412_adtminion7/log but now the results look like /adt-0daac672-9baa-4e1c-a4f3-509b1515c507/results.tgz which is totally unpredictable, not sortable, and useless for efficient evaluation due to the single .tgz. I guess this is because it got reimplemented from scratch without considering the spec, as Celso took this over from Vincent and during that a lot of the current state/knowledge got lost? → also not a regression, but medium in the sense that it keeps blocking moving britney from jenkins to to swift * we still use Jenkins in between → same issue -- we originally designed all this to drop Jenkins from the picture, and now we keep building even more jobs on it * frequent failures to start tests → these are high urgency usually, but being dealt with as the daily "cihelp:" churn. Thanks to Siva for your timely help with those! So, as you see most of these aren't regressions, but from my POV they are almost all necessary to actually improve upon the situation that we had before adt-nova. Also, from my POV pretty much everything above by far outranks "disable network access from tests", as that's a lot of work for little benefit. > I am particularly interested in your point about adt-cloud being a > 'lot' slower that qemu-VMs, specially in backing it up with (rough) > data we already collect in jenkins: > > http://d-jenkins.ubuntu-ci:8080/label/adt&&i386/load-statistics?type=hour > http://d-jenkins.ubuntu-ci:8080/label/adt&&amd64/load-statistics?type=hour Consider a recent example: http://d-jenkins.ubuntu-ci:8080/job/wily-adt-gem2deb/8/ARCH=amd64,label=adt/ (25 minutes) http://d-jenkins.ubuntu-ci:8080/job/wily-adt-gem2deb/8/ARCH=i386,label=adt/ (1:39 minutes) The log doesn't contain the nova setup part. From my experience "nova boot" is a matter of < 1 minute (vs. starting a local VM which takes < 10 s), so that part can't explain most of the difference. I guess that the extra 23 minutes are due to dist-upgrading a too old base image, but this deserves to get more detailled logging. If we would have an actually elastic solution right now, this wouldn't matter that much, but with quadrupling (on average) the time of every test together with a static limit of 20 parallel tests we get a noticeable throughput bottleneck. > Can we schedule a hangout/meeting to discuss these in details and > establish a common view about the current solution status ? Sure! Just pick a time between 05:00 and 17:00 UTC, but preferably not today any more as my voice is still rather rough/weak from a cold. Thanks, Martin -- Martin Pitt | http://www.piware.de Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)
signature.asc
Description: Digital signature
-- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

