On Sat, Jul 4, 2009 at 3:27 PM, Maciej Stachowiak <m...@apple.com> wrote:
> > On Jul 4, 2009, at 11:47 AM, Mike Belshe wrote: > > I'd like to understand what's going to happen with SunSpider in the future. > Here is a set of questions and criticisms. I'm interested in how these can > be addressed. > > There are 3 areas I'd like to see improved in > SunSpider, some of which we've discussed before: > > > #1: SunSpider is currently version 0.9. Will SunSpider ever change? Or is > it static? > I believe that benchmarks need to be able to > move with the times. As JS Engines change and improve, and as new areas are > needed > to be benchmarked, we need to be able to roll the version, fix bugs, and > benchmark new features. The SunSpider version has not changed for ~2yrs. > How can we change this situation? Are there plans for a new version > already underway? > > > I've been thinking about updating SunSpider for some time. There are two > categories of changes I've thought about: > > 1) Quality-of-implementation changes to the harness. Among these might be > ability to use the harness with multiple test sets. That would be 1.0. > > 2) An updated set of tests - the current tests are too short, and don't > adequately cover some areas of the language. I'd like to make the tests take > at least 100ms each on modern browsers on recent hardware. I'd also be > interested in incorporating some of the tests from the v8 benchmark suite, > if the v8 developers were ok with this. That would be SunSpider 2.0. > > The reason I've been hesitant to make any changes is that the press and > independent analysts latched on to SunSpider as a way of comparing > JavaScript implementations. Originally, it was primarily intended to be a > tool for the WebKit team to help us make our JavaScript faster. However, now > that third parties are relying it, there are two things I want to be really > careful about: > > a) I don't want to invalidate people's published data, so significant > changes to the test content would need to be published as a clearly separate > version. > > b) I want to avoid accidentally or intentionally making changes that are > biased in favor of Safari or WebKit-based browsers in general, or that even > give that impression. That would hurt the test's credibility. When we first > made SunSpider, Safari actually didn't do that great on it, which I think > helped people believe that the test wasn't designed to make us look good, it > was designed to be a relatively unbiased comparison. > > Thus, any change to the content would need to be scrutinized in some way. > I'm not sure what it would take to get widespread agreement that a 2.0 > content set is fair, but I agree it's time to make one soonish (before the > end of the year probably). Thoughts on this are welcome. > > > #2: Use of summing as a scoring mechanism is problematic > Unfortunately, the sum-based scoring techniques do not withstand the test > of time as browsers improve. When the benchmark was first introduced, each > test was equally weighted and reasonably large. Over time, however, the > test becomes dominated by the slowest tests - basically the weighting of the > individual tests is variable based on the performance of the JS engine under > test. Today's engines spend ~50% of their time on just string and date > tests. The other tests are largely irrelevant at this point, and becoming > less relevant every day. Eventually many of the tests will take near-zero > time, and the benchmark will have to be scrapped unless we figure out a > better way to score it. Benchmarking research which long pre-dates > SunSpider confirms that geometric means provide a better basis for > comparison: http://portal.acm.org/citation.cfm?id=5673 Can future > versions of the SunSpider driver be made so that they won't become > irrelevant over time? > > > Use of summation instead of geometric mean was a considered choice. The > intent is that engines should focus on whatever is slowest. A simplified > example: let's say it's estimated that likely workload in the field will > consist of 50% Operation A, and 50% of Operation B, and I can benchmark them > in isolation. Now let's say implementation in Foo these operations are > equally fast, while in implementation Bar, Operation A is 4x as fast as in > Foo, while Operation B is 4x as slow as in Foo. A comparison by geometric > means would imply that Foo and Bar are equally good, but Bar would actually > be twice as slow on the intended workload. > BTW - the way to work around this is to have enough sub-benchmarks such that this just doesn't happen. If we have the right test coverage, it seems unlikely to me that a code change would dramatically improve exactly one test at an exponential expense of exactly one other test. I'm not saying it is impossible - just that code changes don't generally cause that behavior. To combat this we can implement a broader base of benchmarks as well as longer-running tests that are not "too micro". This brings up another problem with summation. The only case where summation 'works' is if the benchmark workload is *the right workload* to measure what browsers do. In this case, your argument that slowing down one portion of the benchmark at the expense of another should be measured is reasonable. But, I think the benchmark should be capable of adding more benchmarks over time - potentially even covering corner cases and less-frequented code. These types of micro benchmarks have no place in a summation based scoring model because we can't weight them accurately. Using geometric means, I could still weight the low-priority benchmark at 1/2 (or whatever) the weight of other benchmarks and have meaningful overall scores. However, in a sum based model, a browser which does really badly on one low-priority test can get a horrible score even if it is better than all the other browsers every other benchmarks. Mike > > Of course, doing this requires a judgment call on reasonable balance of > different kinds of code, and that balance needs to be re-evaluated > periodically. But tests based on geometric means also make an implied > judgment call. The operations comprising each individual test are added > linearly. The test then judges that these particular combinations are each > equally important. > > > > #3: The SunSpider harness has a variance problem due to CPU power savings > modes. > Because the test runs a tiny amount of Javascript (often under 10ms) > followed by a 500ms sleep, CPUs will go into power savings modes between > test runs. This radically changes the performance measurements and makes it > so that comparison between two runs is dependent on the user's power savings > mode. To demonstrate this, run SunSpider on two machines- one with the > Windows "balanced" (default) setting for power, and then again with "high > performance". It's easy to see skews of 30% between these two modes. I > think we should change the test harness to avoid such accidental effects. > > (BTW - if you change SunSpider's sleep from 500ms to 10ms, the test runs > in just a few seconds. It is unclear to me why the pauses are so large. My > browser gets a 650ms score, so run 5 times, that test should take ~3000ms. > But due to the pauses, it takes over 1 minute to run test, leaving the CPU > ~96% idle). > > > I think the pauses were large in an attempt to get stable, repeatable > results, but are probably longer than necessary to achieve this. I agree > with you that the artifacts in "balanced" power mode are a problem. Do you > know what timer thresholds avoid the effect? I think this would be a > reasonable "1.0" kind of change. > > > Possible solution: > The dromaeo test suite already incorporates the SunSpider individual tests > under a new benchmark harness which fixes all 3 of the above issues. Thus, > one approach would be to retire SunSpider 0.9 in favor of Dromaeo. > http://dromaeo.com/?sunspider Dromaeo has also done a lot of good work to > ensure statistical significance of the results. Once we have a better > benchmarking framework, it would be great to build a new microbenchmark mix > which more realistically exercises today's JavaScript. > > > In my experience, Dromaeo gives much significantly more variable results > than SunSpider. I don't entirely trust the test to avoid interference from > non-JS browser code executing, and I am not sure their statistical analysis > is sound. In addition, using sum instead of geometric mean was a considered > choice. It would be easy to fix in SunSpider if we wanted to, but I don't > think we should. Also, I don't think Dromaeo has a pure command-line > harness, and it depends on the server so it can't easily be used offline or > with the network disabled. > > Many things about the way the SunSpider harness works are designed to give > precise and repeatable results. That's very important to us, because when > doing performance work we often want to gauge the impact of changes that > have a small performance effect. With Dromaeo there is too much noise to do > this effectively, at least in my past experience. > > Regards, > Maciej > > > > > >
_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev