The next couple of days I'll work on the automation of producing a graphing of this "magic number".
Also I'm doing repeat testing now to get a feeling of the kind of spread in the numbers that cachegrind is producing. Stephan 2012/12/5 Michael Meeks <michael.me...@suse.com> > Hi guys, > > Forwarding Josef's rather interesting mail (with permission) to > the dev > list - since it's generally useful :-) Stephan is working on plugging a > nice callgrinding script into tinderbox building so we can build an > automated performance tracker that will catch even small - 0.5% type > performance regressions which can accumulate over time - and do it in a > reproducible way, independent of the underlying hardware. > > Thanks so much for working on this Stephan; it seems the punch-line > formula for the magic number should be: > > > Actually, what I have in recent versions of {Q,K}Cachegrind is this: > > CEst = Ir + 10 Bm + 10 L1m + 20 Ge + 100 L2m + 100 LLm > > ATB ! > > Michael. > > -------- Forwarded Message -------- > > Hi, > > Am 04.12.2012 15:07, schrieb Julian Seward: > > > > Josef, > > Gruesse. See below .. > > > > Michael: > > Josef is the expert on this, so forwarding it to him. > > > > It used to be "1 * #insns + 10 * #L1 misses + 100 * #L2 misses", > > but I wouldn't put too much reliance on it. It was just something > > that (IIRC!) NickN and I kludged up during a coffee break probably > > about a decade ago at cl.cam.ac.uk. Like so many other bits of > > Valgrind :-) (*) > > AFAIK, Cachegrind always just printed out the raw event numbers. > Also, nowhere in Callgrind you will find that magic formula ;-) > I think I actually was the one who came up with that formula as > a rough "Cycle Estimation" in KCachegrind ;-) > > Actually, doing a somehow accurate time estimation is quite tricky. > The rule of thumb should be: If you come up with an optimization > reducing miss numbers (and also the time estimation formula above), > always also check against real time. > > In general, any estimation using cache metrics gets more accurate > the worse the cache behavior of the application, as time needed > for memory accesses is more predictable and hides any smaller latencies > within a core, especially with out-of-order execution, and even > ARM application cores are out-of-order nowadays (Cortex A9/A15). > > The most accurate thing next to real measurement would be cycle-accurate > simulation, which of course is out of scope. There is some nice work > on a simpler, yet quite accurate processor timing model using so-called > "interval simulation", as implemented in the Sniper simulator > (see http://snipersim.org). However, it's roughly an order of magnitude > slower than Cachegrind/Callgrind. > > The biggest issues with the simple formulas above are: > (1) Hardware prefetching is not taken into account, which often allows > streaming of data at maximum bandwidth from memory, > (2) Latency hiding effects are not included, e.g. it easily can happen > that with lots of memory accesses, it does not matter if the code > actually was compiled with -O3 or -O0. Further, it can make > a huge difference whether two consecutive memory accesses depend on each > other or not (ie. one hiding the latency of the other or not). > (3) Saturation effects are not taken into account. E.g. if all cores > on a multicore chip do memory accesses at the same time, the limited > bandwidth going off-chip will slow down each of them. > > Thus, the accuracy of the simple formula very much depends on an > "average" application behavior, in the hope that the difficult > effects mentioned above do not play too much of a role each. > > BTW, one can argue whether L1/LL misses on write should appear in > the formula at all. Writes usually are done by the hardware in the > background. However, if there is a memcpy operation doing a lot > of consecutive writes, write buffers will fill up, and will > throttle execution in the end. > > > That said, I am sure it got checked by Josef :-) > > Despite above issues, I still think that it is useful to combine > the cache simulation results into one value. It makes it easier to > get an overview regarding the cache behavior. > > The thing is: too make time estimation more accurate, you need to > extend the model, which results in slower simulation. > I have some ideas to make the cache model more accurate for time > estimation, but I am not sure it is worth it (real measurements > are faster anyway). > > E.g. there is an optional hardware prefetch simulation in Callgrind. > We can introduce a new event "LL miss detected by prefetcher", and > give this e.g. a much lower penalty, depending on the bandwidth your > hardware can do: e.g. if you can do 1GB/s on a 1GHz processor, this > means 1 bytes on average per clock, thus with a 64byte line size > one line can be fetched every 64 cycles. As every LL miss will also > be a L1 miss (with penalty of 10), you would go for around 50 for > the penalty for the new prefetchable memory access event. > > For latency effects, it gets more complicated as you want to check > if there is a dependency between memory accesses, and do something > similar to above mentioned interval simulation. > > For the saturation effects, one needs to take simultaneous > execution of multiple threads into account. > > > Josef: is there a more recent version of the pseudo-cycles formula? > > Also, maybe we can incorporate the number of global bus locks into > > the formula? Or does the cost of a bus lock depend also on which > > cache the selected line is in? > > Actually, what I have in recent versions of {Q,K}Cachegrind is this: > CEst = Ir + 10 Bm + 10 L1m + 20 Ge + 100 L2m + 100 LLm > > L2m is there for backward compatiblity with old VG versions. > If a given event type is not measured, the formula gets truncated > accordingly. > > > I mention this because I remember at some time the LO crew were > > complaining that the cycle count from Callgrind was completely misleading > > due to not taking into account global bus locks. > > Hm, ok. The "20 Ge" estimates that locking often can be resolved at the > L2/L3 level. However, I never did any measurements myself. I think you > mentioned something ... > > Josef > > > > > > J > > > > (*) eg. the first version of the memcheck leak checker was written > > during a 4 hour plane-changing-stop at Chicago O'Hare. > > > > ---------- Forwarded Message ---------- > > > > Subject: Re: [Fwd: Re: callgrind / profiling bits ? ...] > > Date: Tuesday, December 04, 2012, 12:35:15 am > > From: Michael Meeks <michael.me...@suse.com> > ... > > Julian - incidentally - where does is the latest/greatest set of > > fudge-factors for predicting pseudo-cycles live from L1 / L2 misses > > etc. :-) We'd like to use that a spreadsheet we want to use to track > > incremental libreoffice performance (using callgrind of course). > ... > > At least from my perspective, I'd -love- to see what impact simple > > day-to-day work has on performance particularly when repeatable: what do > > dozens of changes per day do to the various documents we have ? - I > > suspect slow bit-rot, with steps of improved perf. but ... ;-) perhaps > > publishing the graph will change that. > > -- > michael.me...@suse.com <><, Pseudo Engineer, itinerant idiot > >
_______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice