My main reaction is that we shouldn't rush into this. As long as we have a solution that works for now, there are probably many more important things to work on. Once we have all the other pieces in place to make a usable parallel simulator, then we can worry about performance optimizations such as better handling of the decode cache.
My secondary reaction is that the only potential downside to a globally shared cache is the cost of acquiring a lock on every read access. In the long run, writes should be pretty rare, so the cost of updates should be largely irrelevant. If we can come up with a lock-free way of doing updates, then there is no downside to a globally shared cache. Thus, when we do get to the point of wanting to optimize the decode cache, I think the first order of business is to try and find a way to do lock-free updates. If we're successful (and I expect we will be), then there's no reason to consider any other organization. Steve On Sat, Feb 9, 2013 at 7:01 AM, Nilay <[email protected]> wrote: > We need to decide on how we want to handle the decode cache. I can think > of the following three ways -- > > 1. Per decoder cache: needs most space, hence more cache misses and low > performance. > > 2. Per thread cache: less space then above, so less cache misses > (hopefully). But TLS variables have access costs. Seems like it would at > least two more instructions per access (on x86-64), more depending on the > how bad the compiler performs in analyzing the use of the variable. An > added advantage might be that single thread simulations would not be hurt > at all. > > 3. Global cache: least space, so should have least cache misses. But > requires protection of a lock. The costs will be several usual > instructions + one atomic instruction (should result in some coherency > overhead) even if the lock in not contended for (unlikely). Would require > extra code if we are to avoid hurting single thread simulation > performance. Some RCU-type implementation might be possible as well. > > In my opinion the size of the cache should decide which way to go. If it > is less than a 100 KB or so, per simulated cpu variable seems fine to me, > TLS if about 500 KB, and global variable above that. > > -- > Nilay > > _______________________________________________ > gem5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/gem5-dev > _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
