Hello all, "Walter Bright" <[email protected]> wrote in message news:[email protected]... > Don wrote: >> That would really be fun. >> BTW, the current Intel processors are basically the same as Pentium Pro, >> with a few improvements. The strange thing is, because of all of the >> reordering that happens, swapping the order of two (non-dependent) >> instructions makes no difference at all. So you always need to look at >> every instruction in the a loop, before you can do any scheduling. > > I was looking at Agner's document, and it looks like ordering the > instructions in the 4-1-1 or 4-1-1-1 for optimal decoding could work. This > would fit right in with the way the scheduler works. > > I had thought that with the CPU automatically reordering instructions, > that scheduling them was obsolete.
Reordering happens in the scheduler. A simple model is "Fetch", "Schedule", "Retire". Fetch and retire are done in program order. For code that is hitting well in the cache, the biggest bottleneck is that "4" decoder (the complex instruction decoder). Reducing the number of complex instructions will be a big win here (and settling them into the 4-1-1(-1) pattern). Of course, on anything after Core 2, the "1" decoders can handle pushes, pops, and load-ops (r+=m) (although not load-op-store (m+=r)). Also, "macro op fusion" allows you can get a branch along with the last instruction in decode, potentially giving you 5 macroinstructions per cycle from decode. Make sure it is the flags producing instruction (cmp-br). (I used to work for Intel :) Ned
