It's important to remember that we already do have two levels of caching... the PC-based page cache and the backing hash table. I think making the PC-based page cache lock free should be pretty straightforward. Since the hash-table-based decode cache is only accessed on a page cache miss, maybe just using a lock-based hash table like concurrent_unordered_map would be OK.
I hadn't thought about the challenge of resizing a hash table; I was mostly thinking about how easy it should be to append new entries on the end of a hash chain in a lock-free manner. Note that we never delete anything from the hash table, so that's not an issue. The speculative unlocked lookup followed by a locked try-again-and-insert-if-needed makes sense; I think a readers/writer lock is probably overkill in that case. Another option (if we really do want a lock-free hash table) would be just to preallocate a reasonable number of buckets and not allow it to grow. I'll repeat my initial statement, which is that this whole discussion is probably premature, as we don't really have a general parallel simulator yet. When we do get to the point of optimizing the cache, if there is still contention about the right approach, we should probably start by instrumenting the code and getting a feel for the typical miss rates and sizes. I wouldn't be surprised if we find that, once we get past the initial startup transient, as long as the PC-based page cache hits are fast, the rest of it really doesn't matter. Steve On Sat, Feb 9, 2013 at 1:24 PM, nathan binkert <[email protected]> wrote: > An alternative is to realize that we're computer architects! We could > easily have a two level cache (and basically combine all three of > Nilay's ideas.) You have a per-thread cache, but each decoder would > have its own pointer to that per-thread cache, so you don't have to > use TLS. If you miss in your local cache, then you look in the global > cache, if you hit, you add to the local cache. If you miss, you > create it and update both. (Using a lock on the global cache.) The > good thing is that there is no coherence since we don't actually > delete or update from this cache. > > Unfortunately, without two levels, I don't know how to access an > unordered map without a lock. A concurrent write can cause your > iterator to be invalidated because of a rehash (if the map grew in > size). > > With the regular map, insert does not invalidate iterators, so you > could do something like execute a find without a lock, if you miss, > then acquire a (read?) lock and do the find again if you hit, return, > if you miss, (upgrade to a write lock and) insert the new record. > > If we did a two level thing, I'd suggest the hash map for the > per-thread thing, and the regular map thing for the global thing. > > All that said, there is a simpler option. The > concurrent_unordered_map from TBB is quite good. All locks are > internal and are on a per-bucket basis. Inserts and reads are allowed > without having to do anything special. > > Nate > > On Sat, Feb 9, 2013 at 11:06 AM, Steve Reinhardt <[email protected]> wrote: > > My main reaction is that we shouldn't rush into this. As long as we > have a > > solution that works for now, there are probably many more important > things > > to work on. Once we have all the other pieces in place to make a usable > > parallel simulator, then we can worry about performance optimizations > such > > as better handling of the decode cache. > > > > My secondary reaction is that the only potential downside to a globally > > shared cache is the cost of acquiring a lock on every read access. In > the > > long run, writes should be pretty rare, so the cost of updates should be > > largely irrelevant. If we can come up with a lock-free way of doing > > updates, then there is no downside to a globally shared cache. Thus, > when > > we do get to the point of wanting to optimize the decode cache, I think > the > > first order of business is to try and find a way to do lock-free updates. > > If we're successful (and I expect we will be), then there's no reason to > > consider any other organization. > > > > Steve > > > > > > > > > > On Sat, Feb 9, 2013 at 7:01 AM, Nilay <[email protected]> wrote: > > > >> We need to decide on how we want to handle the decode cache. I can think > >> of the following three ways -- > >> > >> 1. Per decoder cache: needs most space, hence more cache misses and low > >> performance. > >> > >> 2. Per thread cache: less space then above, so less cache misses > >> (hopefully). But TLS variables have access costs. Seems like it would at > >> least two more instructions per access (on x86-64), more depending on > the > >> how bad the compiler performs in analyzing the use of the variable. An > >> added advantage might be that single thread simulations would not be > hurt > >> at all. > >> > >> 3. Global cache: least space, so should have least cache misses. But > >> requires protection of a lock. The costs will be several usual > >> instructions + one atomic instruction (should result in some coherency > >> overhead) even if the lock in not contended for (unlikely). Would > require > >> extra code if we are to avoid hurting single thread simulation > >> performance. Some RCU-type implementation might be possible as well. > >> > >> In my opinion the size of the cache should decide which way to go. If it > >> is less than a 100 KB or so, per simulated cpu variable seems fine to > me, > >> TLS if about 500 KB, and global variable above that. > >> > >> -- > >> Nilay > >> > >> _______________________________________________ > >> gem5-dev mailing list > >> [email protected] > >> http://m5sim.org/mailman/listinfo/gem5-dev > >> > > _______________________________________________ > > gem5-dev mailing list > > [email protected] > > http://m5sim.org/mailman/listinfo/gem5-dev > _______________________________________________ > gem5-dev mailing list > [email protected] > http://m5sim.org/mailman/listinfo/gem5-dev > _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
