I'm not the guy to ask for that... but actually I doubt the protocol itself matters that much, you just need to look at the code path that gets exercised on an L1 cache hit and see where the calls are. That part should be almost if not entirely independent of the coherence protocol.
Steve On Tue, Nov 16, 2010 at 7:13 AM, Nilay Vaish <[email protected]> wrote: > I profiled M5 using MOESI_CMP_directory and MOESI_CMP_token protocols. > isTagPresent() dominates in both of these protocols as well. But the > percentage of simulation time used is lesser (varies from 10% to 16%). I > will take a look at the assembly code for the direct cache set indexing > approach. > > In order to reduce the number of calls made to tag lookup, I would need to > read about the protocol itself. Can you point some documentation on > MOESI_hammer protocol? > > -- > Nilay > > > > On Fri, 12 Nov 2010, Steve Reinhardt wrote: > > On Fri, Nov 12, 2010 at 1:10 PM, Nilay Vaish <[email protected]> wrote: >> >> I tried couple of ideas for improving the performance of the >>> findTagInSet() >>> function. These include having one hash table per cache set and replacing >>> the hash table with a two dimensional array indexed using <cache set, >>> cache >>> way>. Neither of these ideas showed significant enough change in the >>> percentage of the time taken by isTagPresent() function to come to a >>> definite conclusion. >>> >>> I looked the assembly code generated for the isTagPresent() function. >>> Note >>> that the compiler inlines the funtion findTagIsPresent(). The assembly >>> code >>> is about 100 lines long. Again, the find function for std::hash_map gets >>> inlined. There is one division operation in the code and several load >>> operations. If we were to assume that the hash function is able to keep >>> the >>> average occupancy of each bucket close to 1, then I think the time taken >>> by >>> the function would be determined by the time taken by the loads. It might >>> be >>> that a lot of loads end up missing in the cache. >>> >>> >> I'm a little surprised that the direct cache set indexing approach was not >> faster since I'd think that would be far less than 100 instructions, but >> you're right that issues like whether loads hit or miss in the cache will >> have a large impact. >> >> >> As far as reordering the tags is concerned, since the hash_map is not >>> directly under our control, we will have to delete the tag entry in the >>> hash >>> table and insert it again to make sure that it is the first entry that is >>> searched. >>> >>> >> The reordering only makes sense if we replace the hash table with a more >> conventional tag array. >> >> >> Right now I am profiling with coherence protocol as MOESI_hammer. I am >>> thinking of profiling using a different protocol to make sure that it is >>> not >>> an artifact of the protocol in use. >>> >> >> >> That sounds like a good idea. >> >> All in all, we would ideally like to both speed up individual calls and >> reduce the number of calls. IIRC, gprof indicated that findTagInSet() was >> called 4-5X more frequently than there were cache accesses, which makes no >> sense to me; it seems like a typical cache hit should only require a >> single >> tag lookup. >> >> That's another thing to keep in mind, is that typical programs really have >> very high cache hit rates, so another approach is to look at what happens >> in >> the process of servicing an L1 cache hit and optimize that path as much as >> possible. >> >> Steve >> >> _______________________________________________ > m5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/m5-dev >
_______________________________________________ m5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/m5-dev
