Re: [plex86] More on C++ (using assembler can be bad)

Robert W. Cunningham Sun, 07 Jan 2001 03:36:53 -0800
"Hausheer, Geoffrey" wrote:

> Thanks for the info.
> I would like to follow up, however, that the respective cache sizes change
> drastically on IA32 as well.  If I remember correctly (and it has been a few
> years so I probably don't)
> The 386 has no L1 cache
> The 486 and Pentium had small L1 caches (Pentium was 8kData, 8kInstruction?)
> The P6 line has anything from no L2 to a 1(2?)Meg L2.  The L1 was 32k
> The Athlon/Duron has an L2 from 64k-512k and an L1 of 128k
> The Willamette has 256k L1 1M L2 and an L3 (only for the server market?)
>
> Also, any of the processors from the P6 class or newer used
> microcode-translation engines (i.e a RISCish core), so the size of an IA32
> instruction is very hard to determine once it enters the machine (i.e. how
> much room it takes up in the L1)
>
> So how do you optimize assembly-code to fit into the caches in this case?
> It makes a big difference performance wise if you hit the L1 vs the L2 for
> performance critical code, and if you get it wrong, you might end up missing
> most of the time (one line displaces the one you need  next, which displaces
> the one you just read, etc).  I would be very interested in knowing what
> methods you use to design algorithms that fit into a specific cache, when
> you don't know the size of the cache, or how much space a give instruction
> will occupy).  Am I missing something here?  Or was the point 'small code=
> higher probability of fitting in the cache' (in which case you also want to
> avoid any instruction that might end up being translated into micro-code)?
>
> The concept of optimizing for caches has been thrown around in the 'Optimal
> hashing' thread, and I've been wondering how to go about doing it.

For the embedded systems I've designed, the CPU is selected to fit the
application, then the application is optimized for the CPU.  In these cases,
you know exactly what cache is available, and how the code is using it.  On
some of the processors I've used, I've even been able to control how much of
the onboard SRAM (on the CPU die, that is) was dedicated to cache, and how much
to register windows or scratchpad RAM.

For many embedded apps, especially those produced in extremely large volumes,
saving a dime on the CPU can affect literally millions of dollars in profits.
The tuning needed to achieve optimal cost and performance is a matter of
serious money, not just technical prowess.

Of course, very little of this applies to the PC platform simply because "PC
platform" is a very poorly defined term.  There are literally hundreds of
different "x86" processors available, with cache schemes ranging from "none" to
"huge", and instruction execution sequencers with levels of "intelligence"
ranging from "none" to "genius".  On such a "platform", it is impossible to
have a single solution that is optimal for all platform configurations.

There are at least three classes of solutions in this case:

1. Optimize for "good enough" performance across all platforms (or the "least
non-optimal").

2. Optimize for the slowest target platform, effectively disabling the
application for targets more "primitive" than the selected target.

3. Produce an application with multiple optimizations that are configured at
run-time to be the best for the actual processor in use.


Most compilers target the first solution, even if they offer options that
supposedly optimize for other targets.  Enabling and disabling various
instruction ordering schemes has remarkably little difference overall!  The
main processor-specific optimizations come from instructions and instruction
modes that are not present on other processors (such as MMX).  I am aware of
not even a single x86 compiler that even tries to perform cache optimization.
Thus, you are typically stuck with the first option.

The second option is never used in the PC world, but is often used in industry,
where complete control of the target platform exists.  Well, maybe not never,
but it is an extremely gradual process in the PC world.  Remember, Win95 was
the first MS OS incapable of running on the '286 or earlier.  The largest Linux
vendor, Red Hat, still targets the '386 by default!  The decision to abandon
"primitive" platforms is never taken lightly in the PC world, hence the very
rare occurrence of the second option.  Which effectively prevents optimizing
for the latest bleeding-edge platform.

The third option has happened several times on the PC platform, and on the
Apple platform too, each time a major processor architecture upgrade was made
available.  From the dawn of the PC era, applications would either select their
code at runtime, or provide different binaries for different processor
variations.  In the days of CPM, it was not unusual to find code with dual
optimization for the 8080 and Z80 (an 8080 superset).  Later, the same happened
between the 8086/8088 and the NEC V20 (another superset).  It happened again
between the 8086 and the 80286 (PC AT), and again with the 80386:  Multiple
optimizations were often delivered to suit the processor.  But since the 80386,
and the IA32 architecture it introduced, not much has changed within the
applications.

MS evolved its OS, slowly replacing 16-bit code with 32-bit code (started with
Windows 3.1, and first completed with Windows NT 3.5x, though WinMe still
contains some 16-bit code).  The most recent "revolution" in any major PC
platform was Apple's switch from the 68K to the PowerPC architecture, and
during the transition supported the creation of "fat binaries" that contained
complete execution images for each platform.  We may see this again on the PC
platform when Itanium and Sledgehammer finally ship, but I doubt it will be to
the degree seen previously.  The reason why is simply because the IA32 platform
will do fine for home use for the foreseeable future, and there will be no rush
to 64-bit PC processing in the home.  Finally, PC speed is catching up with
demand.  Web browsing, word processing and playing MP3s do just fine on a
P2-233 with 256 MB DRAM!

In essence, "low-level" optimization for the PC platform is a dead topic.
There is no such thing as "the" PC platform:  It defines a huge spectrum.  This
relegates true target optimization to those platforms that are closed and/or
embedded, such as game consoles and automobiles, where there is far less
variability, and thus far more benefit to be gained by wringing every last
measure of performance from a fixed platform.

Of course, the high-level optimizations, at the algorithm and data structure
level, will always be important, since they can improve speed independent of
the processor architecture.  This is where most college programs currently
focus, and this is what the PC industry needs most.  But it is also the reason
why experienced embedded and real-time programmers can name their salary in the
present economy:  Colleges no longer produce graduates skilled in optimization
at the hardware level.

I was extremely fortunate to attend UC San Diego when I did:  The Computer
Engineering program had just been started, and it was a killer:  All of a CS
degree, plus the digital half of an EE degree.  It took me five years to
finish, and the only possible way to finish in four years was to go to summer
session every summer!  We were fed computer architecture at all levels, from
theoretical to practical, from registers to systems.  I took classes from Ken
Bowles, the designer of the UCSD P-System and UCSD Pascal (the virtual machine
and language implementation Java borrowed heavily from).  I studied processor
architecture ranging from Turing machines through 4-bit microprocessors (we had
to design one using no more than 2000 transistors), all the way through Cray
multiprocessors, data-flow processors and early massively parallel systems.  We
studied multi-level caches when they existed only in machines costing multiple
millions of dollars.  We studied compiler implementation and optimization, and
the same for operating systems.  And we also learned the testing and analysis
methods needed to detect and measure performance, and the statistics needed to
make sense of the data.

I graduated just as OO was arriving.  But it seemed later CS and CE graduates
were given OO in the place of the hardware emphasis I had received.  Even then,
the need for software people with hardware knowledge was obvious:  CE grads,
and EE grads with CS minors, were getting job offers 25% higher than the
"plain" CS and EE graduates.  As CS has become more OO, and EE has gone toward
SOCs (Systems On a Chip), few schools produce graduates suited to the middle
ground, where the software meets the hardware.  The theory, tools and skills
needed to perform low-level optimization are no longer being taught.

To effectively optimize an application for cache utilization, you must know
what is going on within the cache(s).  Since it is not possible to dynamically
monitor the actual thread of execution in a processor with multiple cache
levels and instruction reordering logic, performance must be evaluated using
modeling, monitoring and statistics.  The model requires a detailed knowledge
of the processor architecture.  Monitoring requires not only the common
software tools (such as code profilers), but also hardware tools such as
processor emulators and logic analyzers.  And the statistics are needed to
obtain correlations between observed behavior and possible model states:  If
you obtain high correlations, the state of your model likely reflects the
actual internal state of the processor.

The key is the test sequences you design, implement and use.  It is
extraordinarily difficult to generate tests aimed to exercise and measure the
performance of one part of a processor in isolation (even CPU manufacturers
have a hard time of it:  Remember the 80386 FDIV bug?):  The typical case is to
generate a large suite of tests, each of which tests all of the processor, with
hopefully a slight emphasis on one part or another.  The measurements gathered
from running the test suite is then used in conjunction with an "inverted"
model to attempt to reconstruct the processor state.  The process is
conceptually very similar to that done to reconstruct an image from raw CAT
scan data.

There are finally some tools arriving on the scene that can make the processor
internals "visible" again, much as they were before cache and instruction
reordering was implemented within the microprocessor:  Large-scale VHDL and/or
Verilog simulation.  It is now possible to obtain detailed and accurate VHDL
models for some microprocessors, and run that code on a hardware or software
simulator that allows you to feed application software to the model and watch
its precise internal state as the code is executed, caches and reordering
included.  Unfortunately, such models are not made available by Intel for any
of its PC processors.  But Motorola does provide them for several PowerPC
processors, and many manufacturers of embedded processors do likewise.  The
current trend of implementing processor cores in FPGAs and ASICs (such as by
ARC) guarantees that an accurate model must exist and be made available:  The
same VHDL code used to synthesize the core can be run on a simulator.

I have personally run the simulation for an 80386 clone:  I watched it "boot"
to a DOS prompt and respond to input from the keyboard.  All without a physical
80386 being in the room.

At some level, this is similar to what Bochs does.  Using one machine to
implement another.  Bochs does not try to make things "correct" at the silicon
or gate level:  It is able to stick to the higher-level ISA (instruction set
architecture) and processor description (with eratta-not-bugs included, of
course).  And Plex86 doesn't even have to do this:  By being able to "assume"
native x86 hardware, the task is conceptually simplified.  If only the
implementation were simplified as well!  Since the x86 does not implement full
and complete self-virtualization, extensive software is needed to provide that
capability.  And the virtualization must be perfect from the perspective of the
guest OS and application.

For some reason, neither Intel nor AMD have announced plans for their next
generation 64-bit architectures to include true self-virtualization (something
IBM mainframes have done for over 30 years).  If they do, applications such as
Plex86 will become indistinguishable from multi-tasking operating systems.  If
they don't, applications such as Plex86 will be needed for a long time to come.

All this should make it a bit clearer why the general case of optimizing
directly to the PC processor hardware is pretty much a lost cause:  There are
too many hardware variations to consider, and no adequate simulation models
exist for any but the simplest of them.

However, as I mentioned in a earlier post, if Plex86 were targeted to P2+ and
Athlon/Duron architectures, there is enough in common between these
architectures to permit significant (if not extensive) low-level optimization
to be performed.  Simply disregard '386 and '486 hardware, and I suspect much
can be done.  If we exclude the P2 and K5/6 architectures (and the original
cacheless Celeron), then even more low-level optimization can be done,
primarily in cache use improvements.

By the time Plex86 makes it to 1.0, I suspect the P6 architecture will be dead
and gone.  If we were to aim Plex86 exclusively at a P6 target, with the hope
that speed improvements will overwhelm any lack of optimization for future
architectures, then we could deal with no cache smaller than 128 Kbytes, and
optimize accordingly.  It may bring earlier architectures to their knees, but
you have to draw the line somewhere (the '386?), and draw that line at the
"right" place for Plex86 1.0.  It may place a hardship on some Plex86
developers who currently lack P3/Celeron or Athlon/Duron systems, but that
discomfort will likely be relatively short-term.

Make that decision, to effectively restrict the target platform to a single
processor architecture, then we can talk about more precise cache optimization!

About the Pentium4:  I believe the initial release is a faint shadow of the
"true" P7 architecture, and we will see the current Pentium4 fade quickly just
as the early crippled Celerons did.  Plex86 tuned for the P6 architecture will
almost certainly run WORSE on the current Pentium4, even if the Pentium4 has a
significant clock speed advantage.  The current Pentium4 relies very heavily on
its trace cache and SSE2 instructions for improved performance, and was thus
given fewer instruction decoders and less on-board cache than any Pentium3.
And I am clueless how to optimize code for a trace cache, nor have I looked at
SSE2 yet.

Then again, I'm also clueless how to optimize code for any Transmeta
processor!  (Or any other processor architecture that contains more software
than hardware).

Speaking of Transmeta, did everyone hear the news that they will be helping AMD
with the IA32 emulation on Sledgehammer?  (AMD already had a previous contract
with Transmeta to improve power utilization for the laptop Athlon/Duron
processors.)  This new agreement presents a unique opportunity for AMD to offer
full and complete IA32 virtualization within the Sledgehammer native
environment (which utilizes fully merged instruction sets), much as the '386
did with its "virtual 8086" execution mode.  The Itanium appears to offer
separate IA32 and IA64 instruction sets, but without IA32 virtual machines
(right?).  If full IA32 virtualization happens, it will be a serious coup for
AMD.


-BobC
Re: [plex86] More on C++ (using assembler can be bad)

Reply via email to