My main reaction is that we shouldn't rush into this.  As long as we have a
solution that works for now, there are probably many more important things
to work on.  Once we have all the other pieces in place to make a usable
parallel simulator, then we can worry about performance optimizations such
as better handling of the decode cache.

My secondary reaction is that the only potential downside to a globally
shared cache is the cost of acquiring a lock on every read access.  In the
long run, writes should be pretty rare, so the cost of updates should be
largely irrelevant.  If we can come up with a lock-free way of doing
updates, then there is no downside to a globally shared cache.  Thus, when
we do get to the point of wanting to optimize the decode cache, I think the
first order of business is to try and find a way to do lock-free updates.
 If we're successful (and I expect we will be), then there's no reason to
consider any other organization.

Steve




On Sat, Feb 9, 2013 at 7:01 AM, Nilay <[email protected]> wrote:

> We need to decide on how we want to handle the decode cache. I can think
> of the following three ways --
>
> 1. Per decoder cache: needs most space, hence more cache misses and low
> performance.
>
> 2. Per thread cache: less space then above, so less cache misses
> (hopefully). But TLS variables have access costs. Seems like it would at
> least two more instructions per access (on x86-64), more depending on the
> how bad the compiler performs in analyzing the use of the variable. An
> added advantage might be that single thread simulations would not be hurt
> at all.
>
> 3. Global cache: least space, so should have least cache misses. But
> requires protection of a lock. The costs will be several usual
> instructions + one atomic instruction (should result in some coherency
> overhead) even if the lock in not contended for (unlikely). Would require
> extra code if we are to avoid hurting single thread simulation
> performance. Some RCU-type implementation might be possible as well.
>
> In my opinion the size of the cache should decide which way to go. If it
> is less than a 100 KB or so, per simulated cpu variable seems fine to me,
> TLS if about 500 KB, and global variable above that.
>
> --
> Nilay
>
> _______________________________________________
> gem5-dev mailing list
> [email protected]
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to