Re: [m5-dev] [patch] add support for X86 sse3 haddps instruction

Gabriel Michael Black Wed, 16 Dec 2009 16:33:50 -0800

Quoting Vince Weaver <[email protected]>:

> On Wed, 16 Dec 2009, Steve Reinhardt wrote:
>
>> On Sun, Dec 13, 2009 at 8:57 PM, Vince Weaver <[email protected]> wrote:
>> > I did finish running and verifying spec2k on x86_64 (it took longer than
>> > it should have due to an unfortunate power-outage on our cluster).  The
>> > benchmarks all finished, and the retired instruction count matches actual
>> > hardware perf counters very closely.
>> >
>> > http://www.csl.cornell.edu/~vince/projects/m5/m5_x86_64_se_status.html
>>
>> Wow, this is awesome!  I missed this the first time through (didn't
>> scroll down to the end of the message).  Thanks for all the effort,
>> Vince.
>>
>> Are you tracking uops as well as instructions?  I'm curious how close
>> we are on that.
>
> uops for m5 are currently about 1.5x too many, when compared to AMD Phenom
> and Intel Core2 (slightly better, but not much, when compared against a
> Pentium D).
>
> It's slightly worse than 1.5 on integer spec2k and slightly better on fp.
>
> uops are tricky to get right, I imagine the values will be off unless you
> carefully use perf-counters and other tricks (or else have inside
> knowledge) to match real hardware.  And even then, you'd only match a
> particular x86 imlementation, there's wide variation between the various
> generations.  I think PTLSim goes through a lot of trouble to make their
> uop counts match an AMD system, but I don't know how close they manage to
> get.
>
> besides retired instructions, m5 also does a good job (compared to real
> hardware) with L1 dcache accesses.  I was hoping to validate some of the
> other stats, but it's hard to do that with OoO and detailed simulation not
> supported on x86.
>
> Vince


I've been thinking about this since reading your email, and it occurs  
to me the microops may be loads, ops, stores, or opstores and still  
roughly fall into a RISC style architecture. Stores have to wait  
around in the store queue anyway, so they could wait for their data to  
be generated by the ALU without a significant penalty. The most common  
sort of macroop is a load/op/store where one operand is in memory. In  
those cases, if you merge the op and the store, you'd go from 3 ops to  
2, explaining (in this simplified version of the world) the 1.5x  
difference. If you look at the SSE instructions, this sort of single  
memory operation and computation merging is how a lot of them are  
organized, although perhaps loadops instead of opstores (I forget the  
details).

Gabe
_______________________________________________
m5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] [patch] add support for X86 sse3 haddps instruction

Reply via email to