I have a simple change which adds code to x86's STUPD microop that saves the old value of the base register into a backup int as described before. I wanted to know how that affected performance of x86 simulation because I expected it to be very minor, but I wanted to make sure. Unfortunately running twolf on the atomic CPU, there's typically about a 5% slowdown, although it can be worse than that depending on the specific run. The variability makes me think it might be a caching issue, and because it happens with or without the extra storage location in the integer register file, I'm thinking it might be that I'm pushing something just beyond the capacity of the I cache.
Anyway, in a so far fruitless attempt to understand where the performance is going, I ran the old and new versions through gprof. The difference unfortunately vanishes with an instrumented binary, so that didn't help. The output is below, though, and a couple things stand out. First, the predecoder has a big impact on performance for x86. This isn't that surprising since this isn't cached like decoded instructions are, and it usually processes one byte at a time, or perhaps a whole immediate at once. This could possibly be improved by a cache of some sort, although we don't have the advantage of a one to one mapping because the contextualizing info hasn't been added (by the predecoder) yet. If we move the predecoder, and by extension the regular decoder, to a modular design where there are multiple decoders which are used in different circumstances, the contextualizing information could be largely or hopefully entirely implicit based on what predecoder is in place. The second and more immediately useful thing I see is that the == operator for ExtMachInsts is pretty high on the list. I was entertaining the idea of adding some sort of simple hash to the ExtMachInst which would be a hash of the rest of the structure. If the hashes don't match, you can stop right there and not check the rest. The trouble is this is normally used, I think, with ExtMachInsts associated with cached decoded StaticInsts and ExtMachInsts fresh from the predecoder. By forcing the predecoder to calculate a hash for everything it generates we could just be moving the cost around or even making it worse. Then again, ExtMachInst hashes are used to index into the hashmap used for the decoder, I think, so making the hash function transparently return the precomputed hash could save there too. On the other hand the == could be from internal to the hash map, and by comparing with the hash we'll just waste time since if the hash didn't match we wouldn't be in that bucket, and the hash map wouldn't be bothering to compare values. Maybe my ExtMachInst hash function stinks and things are clumped together. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 13.40 8.68 8.68 260854207 0.00 0.00 AtomicSimpleCPU::tick() 9.47 14.81 6.13 173494411 0.00 0.00 X86ISA::Predecoder::process() 7.13 19.42 4.62 260854207 0.00 0.00 BaseSimpleCPU::preExecute() 6.12 23.38 3.96 250692149 0.00 0.00 Bus::recvAtomic(Packet*) 4.25 26.13 2.75 250692149 0.00 0.00 X86ISA::TLB::lookup(unsigned long, bool) 4.06 28.76 2.63 250692149 0.00 0.00 X86ISA::TLB::translate(Request*, ThreadContext*, BaseTLB::Translation*, BaseTLB::Mode, bool&, bool) 3.64 31.12 2.36 250692149 0.00 0.00 PhysicalMemory::doAtomicAccess(Packet*) 3.38 33.31 2.19 219431024 0.00 0.00 BaseSimpleCPU::advancePC(RefCountingPtr<FaultBase>) 2.92 35.20 1.89 219431024 0.00 0.00 BaseSimpleCPU::postExecute() 2.83 37.03 1.83 569732017 0.00 0.00 StaticInstPtr::operator=(StaticInstPtr const&) 2.22 38.47 1.44 132035637 0.00 0.00 X86ISA::operator==(X86ISA::ExtMachInst const&, X86ISA::ExtMachInst const&) 2.15 39.87 1.40 56649590 0.00 0.00 AtomicSimpleCPU::readBytes(unsigned long, unsigned char*, unsigned int, unsigned int) 1.82 41.05 1.18 219430620 0.00 0.00 X86ISA::MacroopBase::fetchMicroop(unsigned short) const 1.79 42.21 1.16 260854208 0.00 0.00 EventQueue::serviceOne() _______________________________________________ m5-dev mailing list [email protected] http://m5sim.org/mailman/listinfo/m5-dev
