Re: Parallel tests in ANT, experiment volunteers welcome :)

Michael McCandless Wed, 04 Jan 2012 15:53:17 -0800

Trying again... hopefully this time NOT hitting this nasty Chrome bug:
    http://code.google.com/p/chromium/issues/detail?id=102407
On Wed, Jan 4, 2012 at 5:11 PM, Dawid Weiss
<[email protected]> wrote:
> I forgot how nasty your beast computer is... 20 slaves?! Remind me how> many 
> actual (real) cores do you have?
Beast has two 6-core CPUs (x5680 xeons), so 12 real cores (24
withhyperthreading).
> Did you experiment with> different slave numbers? I ask because I noticed 
> that:>> 1) it makes little sense to run cpu-intense tests on hyper-cores,> 
> doesn't yield much if anything,> 2) you should leave some room for system vm 
> threads (GC, compilers);> the more VMs, the more room you'll need.
In the past I found somewhere around 20 was good w/ the
Pythonrunner... but I went and tried again!
With the Python runner I see these run times on just lucene core tests:
   2 cpus: 72.2 sec   5 cpus: 35.0 sec  10 cpus: 28.1 sec  15 cpus:
26.2 sec  20 cpus: 26.0 sec  25 cpus: 27.5 sec
So seems like after 15 cores it's not helping much... but then I ranon
all tests (well minus a few intermittently failing tests):
  10 cpus: 88.3 sec  15 cpus: 80.2 sec  20 cpus: 77.4 sec  25 cpus: 76.7 sec
The above were just running on beast, but the Python runner can
sendjobs (hacked up, just using ssh) to other machines... I have two
othernon-beasts, and which I ran 3 jvms on each:
  25 + 3 + 3 cpus: 64.7 sec
With the new ant runner:
2 cpus:
   [junit4] Slave 0:     0.16 ..    50.68 =    50.52s   [junit4] Slave
1:     0.16 ..    49.58 =    49.42s   [junit4] Execution time total:
50.73s   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored


5 cpus:
   [junit4] Slave 0:     0.19 ..    21.87 =    21.68s   [junit4] Slave
1:     0.16 ..    21.86 =    21.70s   [junit4] Slave 2:     0.16 ..
29.31 =    29.15s   [junit4] Slave 3:     0.16 ..    26.64 =    26.48s
  [junit4] Slave 4:     0.19 ..    29.82 =    29.63s   [junit4]
Execution time total: 29.89s   [junit4] Tests summary: 279 suites,
1546 tests, 4 ignored
10 cpus:
   [junit4] Slave 0:     0.21 ..    14.62 =    14.41s   [junit4] Slave
1:     0.22 ..    17.21 =    16.99s   [junit4] Slave 2:     0.23 ..
18.79 =    18.56s   [junit4] Slave 3:     0.23 ..    22.99 =    22.76s
  [junit4] Slave 4:     0.20 ..    27.39 =    27.19s   [junit4] Slave
5:     0.19 ..    27.23 =    27.04s   [junit4] Slave 6:     0.23 ..
20.40 =    20.17s   [junit4] Slave 7:     0.19 ..    26.52 =    26.33s
  [junit4] Slave 8:     0.24 ..    26.42 =    26.18s   [junit4] Slave
9:     0.22 ..    23.57 =    23.35s   [junit4] Execution time total:
27.52s   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored
15 cpus:
   [junit4] Slave 0:     0.29 ..     5.16 =     4.87s   [junit4] Slave
1:     0.26 ..    15.36 =    15.10s   [junit4] Slave 2:     0.26 ..
12.99 =    12.73s   [junit4] Slave 3:     0.29 ..    24.20 =    23.92s
  [junit4] Slave 4:     0.26 ..    27.00 =    26.74s   [junit4] Slave
5:     0.33 ..    19.97 =    19.63s   [junit4] Slave 6:     0.31 ..
25.29 =    24.98s   [junit4] Slave 7:     0.24 ..    28.92 =    28.68s
  [junit4] Slave 8:     0.33 ..    23.67 =    23.34s   [junit4] Slave
9:     0.43 ..    24.43 =    24.00s   [junit4] Slave 10:     0.40 ..
 27.61 =    27.21s   [junit4] Slave 11:     0.22 ..    21.77 =
21.56s   [junit4] Slave 12:     0.22 ..    26.78 =    26.56s
[junit4] Slave 13:     0.26 ..    25.92 =    25.66s   [junit4] Slave
14:     0.35 ..    27.77 =    27.42s   [junit4] Execution time total:
28.98s   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored
20 cpus:
   [junit4] Slave 0:     0.35 ..    23.32 =    22.97s   [junit4] Slave
1:     0.30 ..    24.32 =    24.02s   [junit4] Slave 2:     0.35 ..
21.35 =    21.00s   [junit4] Slave 3:     0.37 ..    23.63 =    23.26s
  [junit4] Slave 4:     0.38 ..    20.74 =    20.35s   [junit4] Slave
5:     0.30 ..    19.74 =    19.44s   [junit4] Slave 6:     0.36 ..
26.39 =    26.03s   [junit4] Slave 7:     0.46 ..    23.64 =    23.18s
  [junit4] Slave 8:     0.43 ..    22.44 =    22.02s   [junit4] Slave
9:     0.30 ..    24.05 =    23.76s   [junit4] Slave 10:     0.41 ..
 24.75 =    24.33s   [junit4] Slave 11:     0.30 ..    22.66 =
22.36s   [junit4] Slave 12:     0.30 ..    24.93 =    24.62s
[junit4] Slave 13:     0.40 ..    24.39 =    24.00s   [junit4] Slave
14:     0.24 ..    24.47 =    24.23s   [junit4] Slave 15:     0.45 ..
  25.23 =    24.78s   [junit4] Slave 16:     0.34 ..    23.06 =
22.72s   [junit4] Slave 17:     0.23 ..    23.50 =    23.28s
[junit4] Slave 18:     0.30 ..    24.27 =    23.97s   [junit4] Slave
19:     0.30 ..    24.91 =    24.61s   [junit4] Execution time total:
26.52s   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored
I only ran once each and results are likely noisy... so it's hard
topick a best CPU count...
>> Does the "Execution time total" include compilation, or is it just the>> 
>> actual test runtime?>> The total is calculated before slave VMs are launched 
>> and after they> complete, so even launch time is included. It's here:> 
>> https://github.com/carrotsearch/randomizedtesting/blob/master/ant-junit4/src/main/java/com/carrotsearch/ant/tasks/junit4/JUnit4.java
Hmm so does that include compile time (my numbers don't)?  Sounds
likeno?  I'm also measuring from first launch to last finish.
>> Can this change run "across" the different groups of tests we have>> (core, 
>> modules/*, contrib/*, solr/*, etc.)?  I found that to be a>> major 
>> bottleneck in the current "ant test"'s concurrency, ie we have a>> pinch 
>> point after each group of tests (must wait for all JVMs to>> finish before 
>> moving on to next group...), but I think fixing that in>> ant is going to be 
>> hard?>> If I understand you correctly the problem is that ANT in Lucene/ 
>> Solr> is calling to sub-module ANT scripts and these in turn invoke the 
>> test> macro. So running everything from a single test task would be 
>> possible> if we had a master-level test script, it's not directly related to 
>> how> the tests are actually executed.
Yes I think that's the problem!
Ideally ant would just gather up all "jobs" to run and then
we'daggregate/distribute across JVMs.
> That JUnit4 task supports globbing in> suite selectors so it could be 
> executed with, say,> -Dtests.class=org.apache.lucene.blah.* to restrict tests 
> to run just a> certain section of all tests, but include everything by 
> default.
Cool.
> Don't know how it affects modularization though -- the tests will run> faster 
> but they'll be more difficult to maintain I guess.
Hmm... can we somehow keep today's directory structure but have
anttreat it as a single "module"?  Or is the problem that we need
tochange the JVM settings (eg CLASSPATH) per test module we havetoday
so we must make separate modules for that...?
>> When I use the hacked up Python test runner (runAllTests.py in 
>> luceneutil),>> This was my inspiration -- Robert pointed me at that, very 
>> helpful> although you need your kind of machine to run so many SSH sessions 
>> :D
OK cool :)  Actually it doesn't open any SSH sessions unless you
giveit remote machines to use -- for the "local" JVMs it just forks.
>> change (balancing the tests across JVMs).  BUT: that's on current>> trunk, 
>> vs your git clone which is somewhat old by now... so it's an>> apples/pears 
>> comparison ;)>> Oh, come on, my fork is only a few days behind! :) I've 
>> pulled the> current trunk and merged. I'd appreciate if you could re-run 
>> again,> this time with, say, 5, 10, 15 and 20 threads. I wonder what the> 
>> speedup/ overhead is. Thanks.
I re-ran above -- looks like the times came down some so the new
antrunner is basically the same as the Python runner (on core
tests):great!
Mike McCandless
http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Parallel tests in ANT, experiment volunteers welcome :)

Reply via email to