Re: Master images are a mess

Zygmunt Krynicki Wed, 07 Dec 2011 19:08:22 -0800

W dniu 08.12.2011 02:22, Michael Hudson-Doyle pisze:

On Wed, 7 Dec 2011 17:01:25 +0100, Zygmunt 
Krynicki<zygmunt.kryni...@linaro.org>  wrote:

Hi, sorry for the topic, I wanted to catch your attention.


This is a quick brain dump based on my own observations/battle with
master images last week.

2) Running code via serial on the master image is a mess. It is very
fragile.


Is it really?  It's a bit of a pain, but it seems this part actually
works ok for us.  It also has the advantage that all the logs are in one
place.


<<CONMUX DISCONNECTED>>

@#!@$ !@ serial output, without any sensible way to break it down (and Idon't count matching "# echo LAVA DISPATCHER: now doing foo" as sensible.


A few random reasons for not using serial the way we do it today:

1) Random console message breaks our system of tracking state andinvoking commands.2) We could put pppd on the serial line to get early networking for ouragent, we could assume we can download stuff in the master image withoutethernet (not that it would be much useful at the speed). We could useTCP to have networked API on the master image.3) Serial console slows down stuff a LOT. Check how fast you can bootwithout serial console (hint, much faster). We can still keep all thelogs around by other means.

Also, I don't see anything in your proposals that would get us away from
having to talk to the bootloader over the serial line.  Also getting the
boot log for a failed boot seems somehow essential and I don't know how
we can do that without a serial connection (this is different from
running commands, though).

I'm not saying "we should not use the serial line". I'm saying "weshould not use the serial line for everything in the most crude formpossible".

For the time being (until I patch u-boot to talk to LAVA) the bootloader will stay as is. For the vast amount of wall clock time spentafter the boot loader we can do smarter things without waiting for thesun to eclipse and origen networking to work.

You can send a series of commands to a device. Get return codes back,without parsing, reliably. You can do structured logging (where thedevice keeps logs for each command it receives), and it will be neverconfused by funny output pattern. We can ask the device to reboot whileother tasks are hanging. We can download stuff without putting wget onthe board and piping it to tar for crying out loud.

We need an agent on the board instead of a random master image+serial
shell. The agent will expose board identity, capabilities and standard
APIs to LAVA (notably the dispatcher).  The same API, if done
sensibly, will work for software emulators and hardware boards. Agent
API for a software emulator can do different things. Dispatcher should
be based on agent API instead of ramming the serial line.


Well, I just rewrote chunks of the dispatcher to work for software
emulators, albeit taking a different approach.  Not sure the approach
you propose is really any different, although perhaps it would be easier
to distribute to different machines.

I don't want to deprecate your work. What I'm doing here (apart fromhand waving and shouting) is discussing how it should work to be morereliable and future proof. I'm sure that implementing this will take alot of time in practice and that dispatcher maintenance is as relevantas it was yesterday. I need to dig deeper into current dispatcher codeto be able to judge this. Still I think that dispatcher is orthogonal.You can build the dispatcher on top of what it currently does or on topof a board API object. Both code variants can coexist for a long while.

3) The master image, as we know it today, should be booting remotely.
The boot loader can stay on the board until we can push it over USB.
The only thing that absolutely has to stay in the card is the lava
board identity file which would be generated from the web UI. There is
no reason to keep rootfs/kernel/initrd there. This means that a single
small card can fit all tests as well. It also means we can reset the
master image (as currently it is writeable by the board and can be
corrupted) before booting to ensure consistent behaviour. I did some
work on that and I managed to boot panda over NFS. Ideally I want to
boot over nbd (netblock device) which is much faster and with proper
"master image" init script we can expose a single read only net block
device to _all_ the boards.


This sounds good.

4) With agent on each board, identity file on the SD card LAVA will
know if cloning happened. We could do dynamic board detection (unplug
the board ->  it goes away, plug it back ->  it shows up). We could move
a board from system to system and have 0config transitions.


I'm not sure about this though.  How do you tell the difference between
the agent going away because booting into the test image failed and it
being unplugged at a particular time?"

Good point. The state of a device is a little bit more complicated thanI presented. I wanted to point out that we could do discovery in areliable way, something that we currently cannot do (and this preventsus from having foolproof provisioning of additional (or very first) devices.

For actual state we'd still have a few "in flux" moments like when doinga power cycle, transitioning from boot loader to kernel+userspacecontext etc.

As for totally unpluging devices. If you require a USB connection thenyou know your device went away ;-) That's what most people will do (onedevice + laptop) and that's what we'll eventually have to do (nodedicated serial / ethernet on devices, everything muxed through USB).Snowball is just a very simple example of that.

5) Dispatcher should drop all configuration files. Sure it made sense
12 months ago when the idea was to run it standalone. Now all of that
configuration should be in the database and should be provided by the
scheduler to the dispatcher as a big serialized argument (or a file
descriptor or a temporary file on disk). Setting up the dispatcher for
a new instance is a pain and unless you can copy stuff from the
validation server and ask everyone around for help it's very hard to
get right.


If you're using a type of board that has support 'upstream' it's
actually pretty easy, you basically just need to create a file per
device that indicates which type it is.


That's good.

Apart from the fact that it's all a bit all over the place, I don't see
how setting up things in the django admin interface is actually easier
than setting it up in the filesystem.

It is not easier except that you can do the UI in Django and thentouching filesystem directly is not an option. I want to get to a pointwhere I can click through some wizards to get my panda working withouthaving to open a console. With a few extra services the system will even_tell_ you that you've got a panda plugged in that needs provisioning.

Having said all of that, I agree with this goal :)

If master images could be constructed programmatically and with a
agent on each "master image" lava would just get that configuration
for free.

6) We should drop conmux. As in the lab we already have TCP/IP sockets
for the serial lines we could just provide my example serial->tcp
script as lava-serial service that people with directly attached
boards would use. We could get a similar lava-power service if that
would make sense. The lava-serial service could be started as an
instance for all USB/SERIAL adapters plugged in if we really wanted
(hello upstart!). The lava-power service would be custom and would
require some config but it is very rare. Only lab and me have
something like that. Again it should be instance based IMHO so I can
say: 'start lava-power CONF=/etc/lava-power/magic-hack.conf' and see
LAVA know about a power service. One could then say that a particular
board uses a particular serial and power services.


I agree here.  conmux is useful, but we don't need the 'mux' part at
all, and I find myself restarting the daemon all the damn time just to
get it working again.


I had the same experience during my (very brief) contact with this.

Thanks
ZK

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Re: Master images are a mess

Reply via email to