I have a simple change which adds code to x86's STUPD microop that saves
the old value of the base register into a backup int as described
before. I wanted to know how that affected performance of x86 simulation
because I expected it to be very minor, but I wanted to make sure.
Unfortunately running twolf on the atomic CPU, there's typically about a
5% slowdown, although it can be worse than that depending on the
specific run. The variability makes me think it might be a caching
issue, and because it happens with or without the extra storage location
in the integer register file, I'm thinking it might be that I'm pushing
something just beyond the capacity of the I cache.

Anyway, in a so far fruitless attempt to understand where the
performance is going, I ran the old and new versions through gprof. The
difference unfortunately vanishes with an instrumented binary, so that
didn't help. The output is below, though, and a couple things stand out.
First, the predecoder has a big impact on performance for x86. This
isn't that surprising since this isn't cached like decoded instructions
are, and it usually processes one byte at a time, or perhaps a whole
immediate at once. This could possibly be improved by a cache of some
sort, although we don't have the advantage of a one to one mapping
because the contextualizing info hasn't been added (by the predecoder)
yet. If we move the predecoder, and by extension the regular decoder, to
a modular design where there are multiple decoders which are used in
different circumstances, the contextualizing information could be
largely or hopefully entirely implicit based on what predecoder is in place.

The second and more immediately useful thing I see is that the ==
operator for ExtMachInsts is pretty high on the list. I was entertaining
the idea of adding some sort of simple hash to the ExtMachInst which
would be a hash of the rest of the structure. If the hashes don't match,
you can stop right there and not check the rest. The trouble is this is
normally used, I think, with ExtMachInsts associated with cached decoded
StaticInsts and ExtMachInsts fresh from the predecoder. By forcing the
predecoder to calculate a hash for everything it generates we could just
be moving the cost around or even making it worse. Then again,
ExtMachInst hashes are used to index into the hashmap used for the
decoder, I think, so making the hash function transparently return the
precomputed hash could save there too. On the other hand the == could be
from internal to the hash map, and by comparing with the hash we'll just
waste time since if the hash didn't match we wouldn't be in that bucket,
and the hash map wouldn't be bothering to compare values. Maybe my
ExtMachInst hash function stinks and things are clumped together.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 13.40      8.68     8.68 260854207     0.00     0.00 
AtomicSimpleCPU::tick()
  9.47     14.81     6.13 173494411     0.00     0.00 
X86ISA::Predecoder::process()
  7.13     19.42     4.62 260854207     0.00     0.00 
BaseSimpleCPU::preExecute()
  6.12     23.38     3.96 250692149     0.00     0.00 
Bus::recvAtomic(Packet*)
  4.25     26.13     2.75 250692149     0.00     0.00 
X86ISA::TLB::lookup(unsigned long, bool)
  4.06     28.76     2.63 250692149     0.00     0.00 
X86ISA::TLB::translate(Request*, ThreadContext*, BaseTLB::Translation*,
BaseTLB::Mode, bool&, bool)
  3.64     31.12     2.36 250692149     0.00     0.00 
PhysicalMemory::doAtomicAccess(Packet*)
  3.38     33.31     2.19 219431024     0.00     0.00 
BaseSimpleCPU::advancePC(RefCountingPtr<FaultBase>)
  2.92     35.20     1.89 219431024     0.00     0.00 
BaseSimpleCPU::postExecute()
  2.83     37.03     1.83 569732017     0.00     0.00 
StaticInstPtr::operator=(StaticInstPtr const&)
  2.22     38.47     1.44 132035637     0.00     0.00 
X86ISA::operator==(X86ISA::ExtMachInst const&, X86ISA::ExtMachInst const&)
  2.15     39.87     1.40 56649590     0.00     0.00 
AtomicSimpleCPU::readBytes(unsigned long, unsigned char*, unsigned int,
unsigned int)
  1.82     41.05     1.18 219430620     0.00     0.00 
X86ISA::MacroopBase::fetchMicroop(unsigned short) const
  1.79     42.21     1.16 260854208     0.00     0.00 
EventQueue::serviceOne()
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Reply via email to