We need to decide on how we want to handle the decode cache. I can think of the following three ways --
1. Per decoder cache: needs most space, hence more cache misses and low performance. 2. Per thread cache: less space then above, so less cache misses (hopefully). But TLS variables have access costs. Seems like it would at least two more instructions per access (on x86-64), more depending on the how bad the compiler performs in analyzing the use of the variable. An added advantage might be that single thread simulations would not be hurt at all. 3. Global cache: least space, so should have least cache misses. But requires protection of a lock. The costs will be several usual instructions + one atomic instruction (should result in some coherency overhead) even if the lock in not contended for (unlikely). Would require extra code if we are to avoid hurting single thread simulation performance. Some RCU-type implementation might be possible as well. In my opinion the size of the cache should decide which way to go. If it is less than a 100 KB or so, per simulated cpu variable seems fine to me, TLS if about 500 KB, and global variable above that. -- Nilay _______________________________________________ gem5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/gem5-dev
