Hi All,

I'm submitting some review requests that make performance improvements to
the existing prefetchers.  Basically, Amin and I realized over time that
the existing gem5 prefetchers are pretty bad if you actually try to use
them as an L1 cache prefetcher.

There are 4 patches in total.

Patch #1 - Started by Amin.  The existing gem5 prefetchers mask off the
lower address bits of an access (they only see block addresses).  This is
bad if your stride is not well-aligned with a cacheline (or just slightly
smaller than a cacheline).  This patch lets the prefetcher work with
un-masked addresses.  It additionally tags prefetches with a PC, so lower
level prefetchers can prefetch in response to prefetch requests (results in
a pipelining of prefetches from lower to higher level caches like in the
Power4).  ~5% perf benefit over existing prefetchers on SPEC.

Patch #2 - Do tagged prefetching for IFETCH accesses to caches with stride
based prefetchers.  This is because the existing stride based prefetcher is
pretty useless for instruction accesses if done at an L2 cache, so its
better off just doing a N-block ahead prefetch.

Patch #3 - Add some extra tolerance to the stride-based prefetcher.  Say
you have code like...

for (int i=0; i<10; i++) A[i] = B[i] + C[i];

If, across 2 executions of thus code, the base addresses of A, B, or C
change, even though the stride is still constant.  The prefetcher will have
to retrain this constant.  This patch avoids this. Gains an additional 2%
perf on SPECFP (geomean).


Patch #4 - The existing gem5 prefetchers all train from speculative
accesses.  This can bad.  This patch makes the L1 prefetcher only trained
on committed accesses, not speculative accesses.  Speculative accesses
still generate prefetches though.  This avoids issues where the branch
predictor is performing poorly (leading to multiple issues of the same
load), even though the access pattern is regular.



I'm still working on creating the separate patch for #4, but here are full
benchmark results for the first 3 patches.

The configuration used for the system is here (its similar to haswell):
http://pastebin.com/90rYgJC3

Here are the benchmark results for patch 1, 2, and 3.  All results are
normalized against the same configuration file with gem5's existing
prefetcher code.

SPECINT: http://i.imgur.com/RPIvz4Z.png
SPECFP: http://i.imgur.com/bIij938.png


I'll post patch 4 and its results when I've separated it out from the rest
of my code.
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to