On Thu, 27 Jan 2011, Steve Reinhardt wrote:

On Thu, Jan 27, 2011 at 4:36 AM, Nilay Vaish <ni...@cs.wisc.edu> wrote:

I tried caching the index for the MRU block, so that the hash table need
not be looked up. It is hard to point if there is a speed up or not. When
I run m5.prof, profile results show that time taken by
CacheMemory::lookup() drops from ~ 5.5% to ~ 4%. But when I run m5.fast,
the time of execution increases by 2%.

Earlier when we had the same address being looked up multiple times in
succession, this change made sense. Right now, I uncertain about whether
this change improves performance.



Hi Nilay,

Are you talking about caching the MRU block for the entire cache?  I agree
that that isn't likely to help much now that we've eliminated the redundant
lookup() calls.

I had a single MRU block per cache in my experiments.

I was thinking of a conventional set-associative cache
lookup (a 2D array, or a 1D array of sets where each set is represented by a
linked list) where the MRU block of each set is moved to the head of the
list.  For an L1 cache in particular, most of the accesses to each set will
be to the MRU block within that set, so even though you have a linear search
within the set, you will often only have to do one comparison.  See
accessBlock() in mem/cache/tags/lru.cc for an example of this.  And even if
you don't hit the MRU block, most L1s have relatively small associativity so
a linear search is not that bad.  I have a hard time believing that this
wouldn't be substantially faster than a hash table lookup.

For caches that are more highly associative, or L2/L3 caches where the
probability of hitting the MRU block is not as high, or for snoop lookups
where you are typically searching the whole set to verify that a block is
*not* there, then a hash table *may* make more sense.  (Though even in that
case I wonder whether a whole-cache hash table vs. per-set hash tables is a
better idea.)  But these should all be much less frequent than L1 hit
lookups, so I would think the set-associative lookup would still be faster
overall.  The optimal strategy in the long run may be to have both types of
lookup implemented and to choose the right one at configuration time based
on the caches associativity and position... we sort of tried to do this in
the original M5 code with the LRU and FALRU tags, but didn't really push the
concept all the way to figuring out the optimal crossover point.


I also have an implementation that performs linear search of cache sets instead of hash table lookup. Again, I saw small improvement. But as you mentioned, when the associativity will go up, linear search will perform worse. Profile results showed that it takes about 100 ns for linear search and about 125 ns for hash table lookup, associativity being 2 for the l1 caches. May be some one should independently profile m5.

My mention above about snooping being a bad case for the set-associative
linear-search lookup makes me wonder about the double icache/dcache lookup
you mentioned... that could be part of the problem, and it's not necessary.
There should be a real coherence protocol between the two caches, so that
you only have to snoop the other cache in the same situations where you'd
have to snoop another core's data cache... i.e., if either cache has an
exclusive copy, or is only reading a shared copy, there's no point in doing
that second lookup.  Is there any notion in the protocol of having different
protocol states for the icache and dcache?


Both icache and the dcache have the same set of states. But now that you have mentioned it, I think it possible to do away with most of double lookups. We maintain the invariant that caches under a single controller maintain disjoint set of cache blocks. In this case, icache and dcache come under the same cache controller. Currently, for an instruction fetch, we check the dcache first. But this is obviously what we should not be doing. Because in common case (unless of course we simulate self-modifying code often enough), the dcache lookup will report a miss and the icache lookup is report a hit. So, we should lookup the icache first. In case of a hit, we know by our invariant that the dcache does not have the line. If we miss, then we should lookup the dcache. We shoul adopt a similar strategy for loads and stores as well.

We can split them under different controllers, but that would mean a lot more work.

I will get back to you on this soon.

--
Nilay
_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev

Reply via email to