On 29/04/08 17:41 +0200, NoiseEHC wrote:
On this page
http://wiki.laptop.org/go/Geode_LX
I have named some instructions as "Synchronized ops" (in the MMX section). Are those real or did I mismeasured something?

That section is very difficult to understand.  I'm not sure which
operations you have invented this name for.
As you probably have already noticed I am not a native English speaker (and neither learned advanced English in school, just picked it up). What I wanted to write in that section, every MMX op, whose source/destination operand is an integer register (and not a MOV), will consume absolutely different clock cycles than 2 (2 is listed for almost every MMX op in the databook, at least in my version). Is it real?
If those are real then would somebody from AMD just go through the databook and fix the instruction clock cycle numbers? Because in that case it is sure that they do not match reality and clearly I have better things to do than measuring clock cycles.

Clearly you must have some basis for assuming that the numbers are
wrong, so you must have done some measurement.  I consulted the
secret documentation that you claim I am withholding from you, and the timings there are the same as in the datasheet. I believe that
you are correct in that these are the clock counts for the instruction to
go through the FPU and don't include the stall time for the pipeline
to clear up.
There is a "Test results" section in that page. The first two test were conducted via email. I have emailed to this list test programs and there were people who run them and emailed back the result. Especially the first test has some stupid bugs because I wrote them essentially blind. The third one is the result of my session logged into a physical machine. It can be that only this "stall time" is missing from the databook but the fact is that I as a programmer am not interested in how many clock cycles does the FPU take to execute some internal operation (which seems the databook to list) but I would like to know the real time consumed.

I am not a silicon designer, so I'm not the final word on if they are
correct or not, but at least that should prove that there isn't a
massive marketing conspiracy to hide the details of the processor
from our customers.  If they are lying to you, they are lying to me,
and they're not lying to me.

This conspiracy thing was not serious, I have used a smiley at the end. However from my perspective there is no difference if there is some conspiracy or if there is not. In fact what I think is either that I am mistaken and made some errors measuring this or the technical writer made mistakes years ago and nobody cared to fix it.
Also the legend is clearly wrong in several cases so probably that would need checking too (like on page 668 note 4 talks about 3DNOW ops in the table about FP ops).

That is an mistake - I have let the technical writer know about it.
Thanks!
Another error:
On page 631 it talks about this:
Conditional jump taken | Conditional jump not taken. (e.g., "4|1" = four clocks if jump taken, one clock if jump not taken).
It is never used in the opcode table.
absolutely no info about L2 cache miss penalties or mispredicted jumps or about the pipeline stages of the FP unit.

I don't have any information about L2 cache miss penalties, but they are easy to calculate. Please see:

http://homepages.cwi.nl/~manegold/Calibrator/
Could you run on your machine and share the results? Currently I do not have access to an XO.
I will talk to somebody about documenting the FP unit pipeline.
It does handle 1 instruction per clock from the integer unit.
In practice we know that two floating point instructions back to
back will stall the IU.  I can also tell you that it is optimized
for single precision, so double precision is handled by microcode
and needs to go through the path again.
Thanks!
I would also like to know how many ALU units does the FPU have? I mean FMUL costs 1, PFMUL costs 2. Is it because it only has 1 multiply unit and it executes PFMUL serially? If that is the case, does that mean that the 3DNOW support is only compatibility and will not be faster than simple FP?
See, all I would like to have is enough data that when I look at assembly code I could approximately calculate how many clock cycles will be consumed. Nothing more and nothing less.

You have nearly all the information you need, and you can collect the
additional information the same way we do, with careful analysis and
measurement.  In fact, Bernie and Vladimir Makarov have done a lot
of work already in this area, resulting in the Geode specific
code for gcc 4.2.0 and glibc.  Perhaps you can work with them to figure
out the finer details of the FPU scheduling.  I'm sure they would
appreciate it.

Jordan



_______________________________________________
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel

Reply via email to