On 08/31/2015 11:25 PM, Ilia Mirkin wrote: > On Tue, Sep 1, 2015 at 1:48 AM, Eirik Byrkjeflot Anonsen > <ei...@eirikba.org> wrote: >> Ian Romanick <i...@freedesktop.org> writes: >> >>> ping. :) >>> >>> On 08/10/2015 11:48 AM, Matt Turner wrote: >>>> On Mon, Aug 10, 2015 at 10:12 AM, Ian Romanick <i...@freedesktop.org> >>>> wrote: >>>>> From: Ian Romanick <ian.d.roman...@intel.com> >>>>> >>>>> On many CPU-limited applications, this is *the* hot path. The idea is >>>>> to generate per-API versions of brw_draw_prims that elide some checks. >>>>> This patch removes render-mode and "is everything in VBOs" checks from >>>>> core-profile contexts. >>>>> >>>>> On my IVB laptop (which may have experienced thermal throttling): >>>>> >>>>> Gl32Batch7: 3.70955% +/- 1.11344% >>>> >>>> I'm getting 3.18414% +/- 0.587956% (n=113) on my IVB, , which probably >>>> matches your numbers depending on your value of n. >>>> >>>>> OglBatch7: 1.04398% +/- 0.772788% >>>> >>>> I'm getting 1.15377% +/- 1.05898% (n=34) on my IVB, which probably >>>> matches your numbers depending on your value of n. >>> >>> This is another thing that make me feel a little uncomfortable with the >>> way we've done performance measurements in the past. If I run my test >>> before and after this patch for 121 iterations, which I have done, I can >>> cut the data at any point and oscillate between "no difference" or X% >>> +/- some-large-fraction-of-X%. Since the before and after code for the >>> compatibility profile path should be identical, "no difference" is the >>> only believable result. >> >> That's pretty much expected, I believe. In essence, you are running 121 >> tests, each with a 95% confidence interval and so should expect >> somewhere around 5 "significant difference" results. That's not entirely >> true of course, since these are not 121 *independent* tests, but the >> basic problem remains. > > (more stats rants follow) > > While my job title has never been 'statistician', I've been around a > bunch of them. Just want to correct this... let's forget about these > tests, but instead think about coin flips (of a potentially unfair > coin). What you're doing is flipping the coin 100 times, and then > looking at the number of times it came up heads and tails. From that > you're inferring the mean of the distribution. Obviously the more > times you do the flip, the more sure you can be of your result. The > "suredness", is expressed as a confidence interval. A 95% CI means > that for 95% such experiments (i.e. "flip a coin 100 times to > determine its true heads:tails ratio"), the *true* mean of the > distribution will lie within the confidence interval (and conversely, > for 5% of such experiments, the true mean will be outside of the > interval). Note how this is _not_ "the mean has a 95% chance of lying > in the interval" or anything like that. One of these runs of 121 > iterations is a single "experiment". > > Bringing this back to what you guys are doing, which is measuring some > metric (say, time), which is hardly binomial, but one might hope that
For the particular test I'm looking at here, I think it should be reasonably close. The test itself runs a small set of frames a few times (3 or 4) and logs the average FPS for the whole run. It seems like the "distribution of means is Gaussian" should apply, yeah? > the amount of time that a particular run takes on a particular machine > at a particular commit is normal. Given that, after 100 runs, you can > estimate that the "true" mean runtime is within a CI. You're then > comparing 2 CI's to determine the % change between the two > distributions, and trying to ascertain whether they are different and > by how much. > > Now, no (finite) amount of experimentation will bring you a CI of 0. > So setting out to *measure* the impact of a change is meaningless > unless you have some precise form of measurement (e.g. lines of code). > All you can do is ask the question "is the change > X". And for any > such X, you can compute the number of runs that you'd need in order to > get a CI bound that is "that tight". You could work this out > mathematically, and it depends on some of the absolute values in > question, but empirically it seems like for 50 runs, you get a CI > width of ~1%. If you're trying to demonstrate changes that are less > than 1%, or demonstrate that the change is no more than 1%, then this > is fine. If you want to demonstrate that the change is no more than > some smaller change, well, these things go like N^2, i.e. if it's 50 > runs for 1%, it's 200 runs for 0.5%, etc. That sounds familiar... that the amount of expected difference determines the lower bound on the number of required experiments. I did take "Statistics for Engineers" not that long ago. Lol. I think I still have my textbook. I'll dig around in it. For a bunch of the small changes, I don't care too much what the difference is. I just want to know whether after is better than before. > This is all still subject to the normal distribution assumption as I > mentioned earlier. You could do some empirical tests and figure out > what the "not-a-normal-distribution" factor is for you, might be as > high as 1.5. > > -ilia _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev