Hi All, I'm submitting some review requests that make performance improvements to the existing prefetchers. Basically, Amin and I realized over time that the existing gem5 prefetchers are pretty bad if you actually try to use them as an L1 cache prefetcher.
There are 4 patches in total. Patch #1 - Started by Amin. The existing gem5 prefetchers mask off the lower address bits of an access (they only see block addresses). This is bad if your stride is not well-aligned with a cacheline (or just slightly smaller than a cacheline). This patch lets the prefetcher work with un-masked addresses. It additionally tags prefetches with a PC, so lower level prefetchers can prefetch in response to prefetch requests (results in a pipelining of prefetches from lower to higher level caches like in the Power4). ~5% perf benefit over existing prefetchers on SPEC. Patch #2 - Do tagged prefetching for IFETCH accesses to caches with stride based prefetchers. This is because the existing stride based prefetcher is pretty useless for instruction accesses if done at an L2 cache, so its better off just doing a N-block ahead prefetch. Patch #3 - Add some extra tolerance to the stride-based prefetcher. Say you have code like... for (int i=0; i<10; i++) A[i] = B[i] + C[i]; If, across 2 executions of thus code, the base addresses of A, B, or C change, even though the stride is still constant. The prefetcher will have to retrain this constant. This patch avoids this. Gains an additional 2% perf on SPECFP (geomean). Patch #4 - The existing gem5 prefetchers all train from speculative accesses. This can bad. This patch makes the L1 prefetcher only trained on committed accesses, not speculative accesses. Speculative accesses still generate prefetches though. This avoids issues where the branch predictor is performing poorly (leading to multiple issues of the same load), even though the access pattern is regular. I'm still working on creating the separate patch for #4, but here are full benchmark results for the first 3 patches. The configuration used for the system is here (its similar to haswell): http://pastebin.com/90rYgJC3 Here are the benchmark results for patch 1, 2, and 3. All results are normalized against the same configuration file with gem5's existing prefetcher code. SPECINT: http://i.imgur.com/RPIvz4Z.png SPECFP: http://i.imgur.com/bIij938.png I'll post patch 4 and its results when I've separated it out from the rest of my code. _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
