W dniu 08.12.2011 02:22, Michael Hudson-Doyle pisze:
On Wed, 7 Dec 2011 17:01:25 +0100, Zygmunt
Krynicki<zygmunt.kryni...@linaro.org> wrote:
Hi, sorry for the topic, I wanted to catch your attention.
This is a quick brain dump based on my own observations/battle with
master images last week.
2) Running code via serial on the master image is a mess. It is very
fragile.
Is it really? It's a bit of a pain, but it seems this part actually
works ok for us. It also has the advantage that all the logs are in one
place.
<<CONMUX DISCONNECTED>>
@#!@$ !@ serial output, without any sensible way to break it down (and I
don't count matching "# echo LAVA DISPATCHER: now doing foo" as sensible.
A few random reasons for not using serial the way we do it today:
1) Random console message breaks our system of tracking state and
invoking commands.
2) We could put pppd on the serial line to get early networking for our
agent, we could assume we can download stuff in the master image without
ethernet (not that it would be much useful at the speed). We could use
TCP to have networked API on the master image.
3) Serial console slows down stuff a LOT. Check how fast you can boot
without serial console (hint, much faster). We can still keep all the
logs around by other means.
Also, I don't see anything in your proposals that would get us away from
having to talk to the bootloader over the serial line. Also getting the
boot log for a failed boot seems somehow essential and I don't know how
we can do that without a serial connection (this is different from
running commands, though).
I'm not saying "we should not use the serial line". I'm saying "we
should not use the serial line for everything in the most crude form
possible".
For the time being (until I patch u-boot to talk to LAVA) the boot
loader will stay as is. For the vast amount of wall clock time spent
after the boot loader we can do smarter things without waiting for the
sun to eclipse and origen networking to work.
You can send a series of commands to a device. Get return codes back,
without parsing, reliably. You can do structured logging (where the
device keeps logs for each command it receives), and it will be never
confused by funny output pattern. We can ask the device to reboot while
other tasks are hanging. We can download stuff without putting wget on
the board and piping it to tar for crying out loud.
We need an agent on the board instead of a random master image+serial
shell. The agent will expose board identity, capabilities and standard
APIs to LAVA (notably the dispatcher). The same API, if done
sensibly, will work for software emulators and hardware boards. Agent
API for a software emulator can do different things. Dispatcher should
be based on agent API instead of ramming the serial line.
Well, I just rewrote chunks of the dispatcher to work for software
emulators, albeit taking a different approach. Not sure the approach
you propose is really any different, although perhaps it would be easier
to distribute to different machines.
I don't want to deprecate your work. What I'm doing here (apart from
hand waving and shouting) is discussing how it should work to be more
reliable and future proof. I'm sure that implementing this will take a
lot of time in practice and that dispatcher maintenance is as relevant
as it was yesterday. I need to dig deeper into current dispatcher code
to be able to judge this. Still I think that dispatcher is orthogonal.
You can build the dispatcher on top of what it currently does or on top
of a board API object. Both code variants can coexist for a long while.
3) The master image, as we know it today, should be booting remotely.
The boot loader can stay on the board until we can push it over USB.
The only thing that absolutely has to stay in the card is the lava
board identity file which would be generated from the web UI. There is
no reason to keep rootfs/kernel/initrd there. This means that a single
small card can fit all tests as well. It also means we can reset the
master image (as currently it is writeable by the board and can be
corrupted) before booting to ensure consistent behaviour. I did some
work on that and I managed to boot panda over NFS. Ideally I want to
boot over nbd (netblock device) which is much faster and with proper
"master image" init script we can expose a single read only net block
device to _all_ the boards.
This sounds good.
4) With agent on each board, identity file on the SD card LAVA will
know if cloning happened. We could do dynamic board detection (unplug
the board -> it goes away, plug it back -> it shows up). We could move
a board from system to system and have 0config transitions.
I'm not sure about this though. How do you tell the difference between
the agent going away because booting into the test image failed and it
being unplugged at a particular time?"
Good point. The state of a device is a little bit more complicated than
I presented. I wanted to point out that we could do discovery in a
reliable way, something that we currently cannot do (and this prevents
us from having foolproof provisioning of additional (or very first) devices.
For actual state we'd still have a few "in flux" moments like when doing
a power cycle, transitioning from boot loader to kernel+userspace
context etc.
As for totally unpluging devices. If you require a USB connection then
you know your device went away ;-) That's what most people will do (one
device + laptop) and that's what we'll eventually have to do (no
dedicated serial / ethernet on devices, everything muxed through USB).
Snowball is just a very simple example of that.
5) Dispatcher should drop all configuration files. Sure it made sense
12 months ago when the idea was to run it standalone. Now all of that
configuration should be in the database and should be provided by the
scheduler to the dispatcher as a big serialized argument (or a file
descriptor or a temporary file on disk). Setting up the dispatcher for
a new instance is a pain and unless you can copy stuff from the
validation server and ask everyone around for help it's very hard to
get right.
If you're using a type of board that has support 'upstream' it's
actually pretty easy, you basically just need to create a file per
device that indicates which type it is.
That's good.
Apart from the fact that it's all a bit all over the place, I don't see
how setting up things in the django admin interface is actually easier
than setting it up in the filesystem.
It is not easier except that you can do the UI in Django and then
touching filesystem directly is not an option. I want to get to a point
where I can click through some wizards to get my panda working without
having to open a console. With a few extra services the system will even
_tell_ you that you've got a panda plugged in that needs provisioning.
Having said all of that, I agree with this goal :)
If master images could be constructed programmatically and with a
agent on each "master image" lava would just get that configuration
for free.
6) We should drop conmux. As in the lab we already have TCP/IP sockets
for the serial lines we could just provide my example serial->tcp
script as lava-serial service that people with directly attached
boards would use. We could get a similar lava-power service if that
would make sense. The lava-serial service could be started as an
instance for all USB/SERIAL adapters plugged in if we really wanted
(hello upstart!). The lava-power service would be custom and would
require some config but it is very rare. Only lab and me have
something like that. Again it should be instance based IMHO so I can
say: 'start lava-power CONF=/etc/lava-power/magic-hack.conf' and see
LAVA know about a power service. One could then say that a particular
board uses a particular serial and power services.
I agree here. conmux is useful, but we don't need the 'mux' part at
all, and I find myself restarting the daemon all the damn time just to
get it working again.
I had the same experience during my (very brief) contact with this.
Thanks
ZK
_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev