We need to decide on how we want to handle the decode cache. I can think
of the following three ways --

1. Per decoder cache: needs most space, hence more cache misses and low
performance.

2. Per thread cache: less space then above, so less cache misses
(hopefully). But TLS variables have access costs. Seems like it would at
least two more instructions per access (on x86-64), more depending on the
how bad the compiler performs in analyzing the use of the variable. An
added advantage might be that single thread simulations would not be hurt
at all.

3. Global cache: least space, so should have least cache misses. But
requires protection of a lock. The costs will be several usual
instructions + one atomic instruction (should result in some coherency
overhead) even if the lock in not contended for (unlikely). Would require
extra code if we are to avoid hurting single thread simulation
performance. Some RCU-type implementation might be possible as well.

In my opinion the size of the cache should decide which way to go. If it
is less than a 100 KB or so, per simulated cpu variable seems fine to me,
TLS if about 500 KB, and global variable above that.

--
Nilay

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to