Re: [Linaro-validation] Extremely hackish but automatic multi-node testing

Michael Hudson-Doyle Wed, 10 Apr 2013 15:29:26 -0700

Antonio Terceiro <[email protected]> writes:

> Hello all,
>
> On Fri, Apr 05, 2013 at 03:40:03PM +1300, Michael Hudson-Doyle wrote:
>> [beware of the x-post!]
>> [resend from correct email address...]
>> 
>> Hi all,
>> 
>> As discussed briefly with some of you, I've been hacking on some scripts
>> to allow us to run some tests / benchmarks that make use of more than
>> one calxeda node before we get proper support in LAVA.  The script is
>> here:
>> 
>> http://bazaar.launchpad.net/~mwhudson/+junk/highbank-bench-scripts/view/head:/fakedispatcher.py
>> 
>> but it's pretty terrible code, you probably don't want to look at it.
>
> terrible code, but still pretty cool :)


It's a bit less grotty now than when I wrote the email :-)

>> More interesting might be the test code branch:
>> 
>> http://bazaar.launchpad.net/~mwhudson/+junk/highbank-bench-1/files
>> 
>> If it's not clear, the idea here is that:
>> 
>>  1) devices.yaml file defines roles for nodes and says how many of each
>>     role you want,
>>  2) the test code branch is copied to each node,
>>  3) the run-$rolename.sh script is run on each node,
>>  4) finally the contents of the /lava/results directory on each node is
>>     copied back to the system the tests were run from.
>> 
>> Coordination between nodes is done via lava-wait-for and lava-send shell
>> scripts as proposed in the connect session.
>> 
>> fakedispatcher is invoked with the URL of the test code branch, e.g.:
>> 
>> python fakedispatcher.py lp:~mwhudson/+junk/highbank-bench-1
>
> Great job! It's very cool that we have an interim sollution for you guys
> to go ahead with your tests while the LAVA implementation is not ready.
>
>> Some notes:
>> 
>> 1) I hope that using an "API" like that proposed in the connect session
>>    will let us figure out if it's actually a useful api.
>
> It would indeed be really useful to have feedback on that API, the
> sooner the better.

So this might be an interesting story already.  For the 'finding other
hosts' stuff, I've been using the "lava-group >> /etc/hosts" idea and
it's been working great.

I've written and used three versions of the API now, each layer using
the previous one.

The first was what more-or-less what we came up with for the connect
session (although I hardcoded that run:steps: would be
./run-`lava-role`.sh) -- create scripts on the target called lava-send
and lava-wait-for that send and wait for events (globally named across
the whole test).  This worked, but it gets awkward when you want to
change say the number of client nodes you have running httperf or
whatever.  I had things like this at one point:

    lava-send client-ready-$(lava-self)
    lava-wait-for server-ready
    lava-wait-for client-ready-$(lava-self | tr 12 21)

(from
http://bazaar.launchpad.net/~mwhudson/+junk/highbank-bench-1/view/4/run-client.sh)

So the next idea was to realize that mostly what you want is global
synchronization between all the nodes -- there is a setup phase, where
you install and configure packages then all nodes synchronize and enter
the excercise phase[1], then synchronize again.  So I implemented a
lava-sync script that implements such a global barrier (by calling
lava-send ${EVENT_NAME}-$(lava-self) and server side code to wait for
${EVENT_NAME}-${NODE_NAME} for all nodes).  This results in code like:

    apt-get install -y apache2-utils
    lava-group >> /etc/hosts
    lava-sync ready
    ab -n 1000 http://server01/image.jpg 2>&1 | tee /lava/results/ab-output.txt
    lava-sync done

which is much nicer, and doesn't have to change as you dial the number
of load generation nodes up and down.

As you increase the number of synchronization points though (for example
if you want to run several test cases, each with their own set up code),
a limitation of this approach becomes clear: you have to be consistent
in how you name the events!  If you write lava-sync test-1-setup-done
in one file and lava-sync setup-test-1-done in another, the test just
hangs[2].

So the next thing I did was implement a way to write the setup,
execution and cleanup in the YAML per test case and role and "compile"
it into scripts with appropriate synchronization events inserted (and
also some other bits to allow the test results to be turned into a LAVA
bundle).  The examples get a bit lengthy, but for example, this:

    http://paste.ubuntu.com/5696781/

gets turned into this:

    http://paste.ubuntu.com/5696782/

The problem *this* has is that you have to be super explicit in the
yaml: it would be nice to be able to programmatically work through the
perf-client/no-perf-client x perf-server/no-perf-server (x
single-threaded client/multi-threaded-client) combinations somehow.  I
don't have any good ideas what to do about this though (I guess we could
allow the testdef to provide code to produce the testcases data
structure...).

Some other API observations:

1) There is definite tension between defining the the roles and device
   in the job file vs the testdef.  There is a tight binding between the
   role names and the test code for sure but maaaaayyyybeee being able
   to vary the device counts in the job file might be useful, rather
   than having to commit a change to the testdef repo (otherwise you
   can't really run tests with different device counts in parallel...).
   Not sure what to do about this.

2) The testnames in /etc/hosts works well when the number of nodes
   issuing connections varies (e.g. varying the number of load
   generation nodes) but we're probably going to want something
   different when varying the number of nodes being connected to
   (e.g. varying the number of origin servers in a haproxy test).
   Something as simple as lava-nodes --role=origin is probably enough
   though.

>> 2) fakedispatcher is still pretty terrible in many ways (e.g. it has a
>>    hardcoded list of (currently just 2) calxeda nodes to use), and
>>    either gives obscure errors or just hangs when things go wrong.
>> 
>> 3) Despite 2), it does actually work.  So I'm pretty happy about that,
>>    and would like to talk to all of you about how to write your test
>>    cases in a form that works with fakedispatcher :)
>
> It's worth mentioning that:
>
> - porting test cases written for fakedispatcher to LAVA proper should
>   require very low effort.

Yeah, I hope so.

> - having such test cases already written is going to help the LAVA
>   implementation a lot in terms of both delivery time and quality.

I wonder if you want to rewrite the dispatcher to use Twisted throughout
:-)

>   (and we, the LAVA team, might do the necessary porting ourselves)
>   hint, hint ;-)

Heh, wouldn't complain!

Cheers,
mwh

[1] Oh look: http://xunitpatterns.com/Four%20Phase%20Test.html -- in our
    case verification happens on the host and teardown is basically
    "throw the OS image away"

[2] I guess one could detect this sort of impossible situation and crap
    out rather than hang, but still.

_______________________________________________
linaro-validation mailing list
[email protected]
http://lists.linaro.org/mailman/listinfo/linaro-validation

Re: [Linaro-validation] Extremely hackish but automatic multi-node testing

Reply via email to