Re: callgrind / profiling ...

Stephan van den Akker Thu, 06 Dec 2012 01:04:17 -0800

The next couple of days I'll work on the automation of producing a graphing
of this "magic number".


Also I'm doing repeat testing now to get a feeling of the kind of spread in
the numbers that cachegrind is producing.

Stephan


2012/12/5 Michael Meeks <michael.me...@suse.com>

> Hi guys,
>
>         Forwarding Josef's rather interesting mail (with permission) to
> the dev
> list - since it's generally useful :-) Stephan is working on plugging a
> nice callgrinding script into tinderbox building so we can build an
> automated performance tracker that will catch even small - 0.5% type
> performance regressions which can accumulate over time - and do it in a
> reproducible way, independent of the underlying hardware.
>
>         Thanks so much for working on this Stephan; it seems the punch-line
> formula for the magic number should be:
>
> > Actually, what I have in recent versions of {Q,K}Cachegrind is this:
> >  CEst = Ir + 10 Bm + 10 L1m + 20 Ge + 100 L2m + 100 LLm
>
>         ATB !
>
>                 Michael.
>
> -------- Forwarded Message --------
>
> Hi,
>
> Am 04.12.2012 15:07, schrieb Julian Seward:
> >
> > Josef,
> > Gruesse.  See below ..
> >
> > Michael:
> > Josef is the expert on this, so forwarding it to him.
> >
> > It used to be "1 * #insns + 10 * #L1 misses + 100 * #L2 misses",
> > but I wouldn't put too much reliance on it.  It was just something
> > that (IIRC!) NickN and I kludged up during a coffee break probably
> > about a decade ago at cl.cam.ac.uk.  Like so many other bits of
> > Valgrind :-) (*)
>
> AFAIK, Cachegrind always just printed out the raw event numbers.
> Also, nowhere in Callgrind you will find that magic formula ;-)
> I think I actually was the one who came up with that formula as
> a rough "Cycle Estimation" in KCachegrind ;-)
>
> Actually, doing a somehow accurate time estimation is quite tricky.
> The rule of thumb should be: If you come up with an optimization
> reducing miss numbers (and also the time estimation formula above),
> always also check against real time.
>
> In general, any estimation using cache metrics gets more accurate
> the worse the cache behavior of the application, as time needed
> for memory accesses is more predictable and hides any smaller latencies
> within a core, especially with out-of-order execution, and even
> ARM application cores are out-of-order nowadays (Cortex A9/A15).
>
> The most accurate thing next to real measurement would be cycle-accurate
> simulation, which of course is out of scope. There is some nice work
> on a simpler, yet quite accurate processor timing model using so-called
> "interval simulation", as implemented in the Sniper simulator
> (see http://snipersim.org). However, it's roughly an order of magnitude
> slower than Cachegrind/Callgrind.
>
> The biggest issues with the simple formulas above are:
> (1) Hardware prefetching is not taken into account, which often allows
> streaming of data at maximum bandwidth from memory,
> (2) Latency hiding effects are not included, e.g. it easily can happen
> that with lots of memory accesses, it does not matter if the code
> actually was compiled with -O3 or -O0. Further, it can make
> a huge difference whether two consecutive memory accesses depend on each
> other or not (ie. one hiding the latency of the other or not).
> (3) Saturation effects are not taken into account. E.g. if all cores
> on a multicore chip do memory accesses at the same time, the limited
> bandwidth going off-chip will slow down each of them.
>
> Thus, the accuracy of the simple formula very much depends on an
> "average" application behavior, in the hope that the difficult
> effects mentioned above do not play too much of a role each.
>
> BTW, one can argue whether L1/LL misses on write should appear in
> the formula at all. Writes usually are done by the hardware in the
> background. However, if there is a memcpy operation doing a lot
> of consecutive writes, write buffers will fill up, and will
> throttle execution in the end.
>
>  > That said, I am sure it got checked by Josef :-)
>
> Despite above issues, I still think that it is useful to combine
> the cache simulation results into one value. It makes it easier to
> get an overview regarding the cache behavior.
>
> The thing is: too make time estimation more accurate, you need to
> extend the model, which results in slower simulation.
> I have some ideas to make the cache model more accurate for time
> estimation, but I am not sure it is worth it (real measurements
> are faster anyway).
>
> E.g. there is an optional hardware prefetch simulation in Callgrind.
> We can introduce a new event "LL miss detected by prefetcher", and
> give this e.g. a much lower penalty, depending on the bandwidth your
> hardware can do: e.g. if you can do 1GB/s on a 1GHz processor, this
> means 1 bytes on average per clock, thus with a 64byte line size
> one line can be fetched every 64 cycles. As every LL miss will also
> be a L1 miss (with penalty of 10), you would go for around 50 for
> the penalty for the new prefetchable memory access event.
>
> For latency effects, it gets more complicated as you want to check
> if there is a dependency between memory accesses, and do something
> similar to above mentioned interval simulation.
>
> For the saturation effects, one needs to take simultaneous
> execution of multiple threads into account.
>
> > Josef: is there a more recent version of the pseudo-cycles formula?
> > Also, maybe we can incorporate the number of global bus locks into
> > the formula?  Or does the cost of a bus lock depend also on which
> > cache the selected line is in?
>
> Actually, what I have in recent versions of {Q,K}Cachegrind is this:
>   CEst = Ir + 10 Bm + 10 L1m + 20 Ge + 100 L2m + 100 LLm
>
> L2m is there for backward compatiblity with old VG versions.
> If a given event type is not measured, the formula gets truncated
> accordingly.
>
> > I mention this because I remember at some time the LO crew were
> > complaining that the cycle count from Callgrind was completely misleading
> > due to not taking into account global bus locks.
>
> Hm, ok. The "20 Ge" estimates that locking often can be resolved at the
> L2/L3 level. However, I never did any measurements myself. I think you
> mentioned something ...
>
> Josef
>
>
> >
> > J
> >
> > (*) eg.  the first version of the memcheck leak checker was written
> > during a 4 hour plane-changing-stop at Chicago O'Hare.
> >
> > ----------  Forwarded Message  ----------
> >
> > Subject: Re: [Fwd: Re: callgrind / profiling bits ? ...]
> > Date: Tuesday, December 04, 2012, 12:35:15 am
> > From: Michael Meeks <michael.me...@suse.com>
> ...
> >       Julian - incidentally - where does is the latest/greatest set of
> > fudge-factors for predicting pseudo-cycles live from L1 / L2 misses
> > etc. :-) We'd like to use that a spreadsheet we want to use to track
> > incremental libreoffice performance (using callgrind of course).
> ...
> >       At least from my perspective, I'd -love- to see what impact simple
> > day-to-day work has on performance particularly when repeatable: what do
> > dozens of changes per day do to the various documents we have ? - I
> > suspect slow bit-rot, with steps of improved perf. but ... ;-) perhaps
> > publishing the graph will change that.
>
> --
> michael.me...@suse.com  <><, Pseudo Engineer, itinerant idiot
>
>

_______________________________________________
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: callgrind / profiling ...

Reply via email to