Trying again... hopefully this time NOT hitting this nasty Chrome bug: http://code.google.com/p/chromium/issues/detail?id=102407 On Wed, Jan 4, 2012 at 5:11 PM, Dawid Weiss <[email protected]> wrote: > I forgot how nasty your beast computer is... 20 slaves?! Remind me how> many > actual (real) cores do you have? Beast has two 6-core CPUs (x5680 xeons), so 12 real cores (24 withhyperthreading). > Did you experiment with> different slave numbers? I ask because I noticed > that:>> 1) it makes little sense to run cpu-intense tests on hyper-cores,> > doesn't yield much if anything,> 2) you should leave some room for system vm > threads (GC, compilers);> the more VMs, the more room you'll need. In the past I found somewhere around 20 was good w/ the Pythonrunner... but I went and tried again! With the Python runner I see these run times on just lucene core tests: 2 cpus: 72.2 sec 5 cpus: 35.0 sec 10 cpus: 28.1 sec 15 cpus: 26.2 sec 20 cpus: 26.0 sec 25 cpus: 27.5 sec So seems like after 15 cores it's not helping much... but then I ranon all tests (well minus a few intermittently failing tests): 10 cpus: 88.3 sec 15 cpus: 80.2 sec 20 cpus: 77.4 sec 25 cpus: 76.7 sec The above were just running on beast, but the Python runner can sendjobs (hacked up, just using ssh) to other machines... I have two othernon-beasts, and which I ran 3 jvms on each: 25 + 3 + 3 cpus: 64.7 sec With the new ant runner: 2 cpus: [junit4] Slave 0: 0.16 .. 50.68 = 50.52s [junit4] Slave 1: 0.16 .. 49.58 = 49.42s [junit4] Execution time total: 50.73s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored
5 cpus: [junit4] Slave 0: 0.19 .. 21.87 = 21.68s [junit4] Slave 1: 0.16 .. 21.86 = 21.70s [junit4] Slave 2: 0.16 .. 29.31 = 29.15s [junit4] Slave 3: 0.16 .. 26.64 = 26.48s [junit4] Slave 4: 0.19 .. 29.82 = 29.63s [junit4] Execution time total: 29.89s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored 10 cpus: [junit4] Slave 0: 0.21 .. 14.62 = 14.41s [junit4] Slave 1: 0.22 .. 17.21 = 16.99s [junit4] Slave 2: 0.23 .. 18.79 = 18.56s [junit4] Slave 3: 0.23 .. 22.99 = 22.76s [junit4] Slave 4: 0.20 .. 27.39 = 27.19s [junit4] Slave 5: 0.19 .. 27.23 = 27.04s [junit4] Slave 6: 0.23 .. 20.40 = 20.17s [junit4] Slave 7: 0.19 .. 26.52 = 26.33s [junit4] Slave 8: 0.24 .. 26.42 = 26.18s [junit4] Slave 9: 0.22 .. 23.57 = 23.35s [junit4] Execution time total: 27.52s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored 15 cpus: [junit4] Slave 0: 0.29 .. 5.16 = 4.87s [junit4] Slave 1: 0.26 .. 15.36 = 15.10s [junit4] Slave 2: 0.26 .. 12.99 = 12.73s [junit4] Slave 3: 0.29 .. 24.20 = 23.92s [junit4] Slave 4: 0.26 .. 27.00 = 26.74s [junit4] Slave 5: 0.33 .. 19.97 = 19.63s [junit4] Slave 6: 0.31 .. 25.29 = 24.98s [junit4] Slave 7: 0.24 .. 28.92 = 28.68s [junit4] Slave 8: 0.33 .. 23.67 = 23.34s [junit4] Slave 9: 0.43 .. 24.43 = 24.00s [junit4] Slave 10: 0.40 .. 27.61 = 27.21s [junit4] Slave 11: 0.22 .. 21.77 = 21.56s [junit4] Slave 12: 0.22 .. 26.78 = 26.56s [junit4] Slave 13: 0.26 .. 25.92 = 25.66s [junit4] Slave 14: 0.35 .. 27.77 = 27.42s [junit4] Execution time total: 28.98s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored 20 cpus: [junit4] Slave 0: 0.35 .. 23.32 = 22.97s [junit4] Slave 1: 0.30 .. 24.32 = 24.02s [junit4] Slave 2: 0.35 .. 21.35 = 21.00s [junit4] Slave 3: 0.37 .. 23.63 = 23.26s [junit4] Slave 4: 0.38 .. 20.74 = 20.35s [junit4] Slave 5: 0.30 .. 19.74 = 19.44s [junit4] Slave 6: 0.36 .. 26.39 = 26.03s [junit4] Slave 7: 0.46 .. 23.64 = 23.18s [junit4] Slave 8: 0.43 .. 22.44 = 22.02s [junit4] Slave 9: 0.30 .. 24.05 = 23.76s [junit4] Slave 10: 0.41 .. 24.75 = 24.33s [junit4] Slave 11: 0.30 .. 22.66 = 22.36s [junit4] Slave 12: 0.30 .. 24.93 = 24.62s [junit4] Slave 13: 0.40 .. 24.39 = 24.00s [junit4] Slave 14: 0.24 .. 24.47 = 24.23s [junit4] Slave 15: 0.45 .. 25.23 = 24.78s [junit4] Slave 16: 0.34 .. 23.06 = 22.72s [junit4] Slave 17: 0.23 .. 23.50 = 23.28s [junit4] Slave 18: 0.30 .. 24.27 = 23.97s [junit4] Slave 19: 0.30 .. 24.91 = 24.61s [junit4] Execution time total: 26.52s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored I only ran once each and results are likely noisy... so it's hard topick a best CPU count... >> Does the "Execution time total" include compilation, or is it just the>> >> actual test runtime?>> The total is calculated before slave VMs are launched >> and after they> complete, so even launch time is included. It's here:> >> https://github.com/carrotsearch/randomizedtesting/blob/master/ant-junit4/src/main/java/com/carrotsearch/ant/tasks/junit4/JUnit4.java Hmm so does that include compile time (my numbers don't)? Sounds likeno? I'm also measuring from first launch to last finish. >> Can this change run "across" the different groups of tests we have>> (core, >> modules/*, contrib/*, solr/*, etc.)? I found that to be a>> major >> bottleneck in the current "ant test"'s concurrency, ie we have a>> pinch >> point after each group of tests (must wait for all JVMs to>> finish before >> moving on to next group...), but I think fixing that in>> ant is going to be >> hard?>> If I understand you correctly the problem is that ANT in Lucene/ >> Solr> is calling to sub-module ANT scripts and these in turn invoke the >> test> macro. So running everything from a single test task would be >> possible> if we had a master-level test script, it's not directly related to >> how> the tests are actually executed. Yes I think that's the problem! Ideally ant would just gather up all "jobs" to run and then we'daggregate/distribute across JVMs. > That JUnit4 task supports globbing in> suite selectors so it could be > executed with, say,> -Dtests.class=org.apache.lucene.blah.* to restrict tests > to run just a> certain section of all tests, but include everything by > default. Cool. > Don't know how it affects modularization though -- the tests will run> faster > but they'll be more difficult to maintain I guess. Hmm... can we somehow keep today's directory structure but have anttreat it as a single "module"? Or is the problem that we need tochange the JVM settings (eg CLASSPATH) per test module we havetoday so we must make separate modules for that...? >> When I use the hacked up Python test runner (runAllTests.py in >> luceneutil),>> This was my inspiration -- Robert pointed me at that, very >> helpful> although you need your kind of machine to run so many SSH sessions >> :D OK cool :) Actually it doesn't open any SSH sessions unless you giveit remote machines to use -- for the "local" JVMs it just forks. >> change (balancing the tests across JVMs). BUT: that's on current>> trunk, >> vs your git clone which is somewhat old by now... so it's an>> apples/pears >> comparison ;)>> Oh, come on, my fork is only a few days behind! :) I've >> pulled the> current trunk and merged. I'd appreciate if you could re-run >> again,> this time with, say, 5, 10, 15 and 20 threads. I wonder what the> >> speedup/ overhead is. Thanks. I re-ran above -- looks like the times came down some so the new antrunner is basically the same as the Python runner (on core tests):great! Mike McCandless http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
