"Hausheer, Geoffrey" wrote:
> Since everyone is throwing in their two cents, here's mine:
> Writing assembly code is almost never the right solution anymore.
> This is because processors do their own optimization. Thus, code that has
> been tuned really well for a Pentium (say using instruction ordering to
> maximize use of the superscalar architecture) may well be incredibly
> inneficient on a Pentium II (which uses out-of-order execution) or on a
> Pentium IV or Athlon (which both have lots of execution engines and
> sophisticated schedulers). The compiler can be optimized to take advantage
> of these things (pgcc), but it is usually bad form to try to take advantage
> of these optimizations in user-written code. Writing in assembly to reduce
> the number of instruction calls, fine, trying to use 'better' instructions
> or place them in a certain order, bad for portability. This is not MIPS,
> DLX, or embedded programming. It's IA32 where there are no gaurantees.
I no longer believe the primary goal of hand-tuned assembler is to take
advantage of general instruction execution characteristics. Today, on the very
few occasions I do hand-tune code, it is for the following reasons:
1. Get it to fit in and run from a faster cache.
This has proven critical for many uses of MMX, SSE, and 3DNOW (and similar SIMD
instruction subsets on other processors), where the x86 CISC architecture
morphs to become SIMD, more like a DSP. When doing math, code and operands
(data) have to be optimized concurrently. While many compilers sometimes do
well with code optimizations, they often lack the ability to perform any
cache-use optimizations for code, and never (that I'm aware of) for data. In
general, performance of SIMD code is extraordinarily sensitive to the locality
of the data, which is why most SIMD instruction sets operate primarily from
registers. The same considerations can apply to just about any code that needs
to run often and run fast.
If you want to see a good example of data-locality optimization, take a look at
some of the Cray compilers: Many are Open Source or Public Domain, and they
prove how re-arranging your data in RAM and cache can greatly improve the
performance of any large calculation on a vector (SIMD) machine.
2. Employ instructions or instruction sequences the compiler fails to generate.
I often take a pass through the performance-critical sections of the assembly
code generated by compilers to identify sequences that are obviously
non-optimal, either in size or the instructions selected for the operation,
especially where data handling is concerned. In essence, I act as a human
peep-hole optimizer, since I only look at the code from a local perspective.
When doing this, I typically focus on optimizing the code for the SLOWEST
processor the code needs to perform well on (a very typical consideration for
embedded systems). More often than I'd like to admit, it is still a 33 MHz
'386!
Over the years I have developed a few standard "tricks" I use frequently.
About 10 years ago I made a serious effort to incorporate some of these tricks
into GCC. While I succeeded in the technical sense, I failed miserably
overall: My changes were just too special purpose, and slowed down the
optimizers (PH and RTL) beyond belief. (For some reason, nobody wanted to
commit my changes, not even me. Not even after I created a "-Ocomplex" flag to
isolate my changes.)
Now and then (but less every year), I still hand-scan the assembly code. I
haven't written any significant software directly in assembler (from scratch)
in well over a decade. I find it is always better to start with the output of
a C/C++ compiler, and use a hardware emulator and an execution profiler
(typical embedded tools, along with a logic analyzer) to find obvious
bottlenecks. First I try to tweak the C/C++ code, and only if that fails will
I hack the assembly code.
That's another aspect of optimization that is seldom addressed: Learning the
specific strengths and weaknesses of a compiler, then coding to make best use
of the strengths, and avoid the weaknesses. Only by becoming familiar with
assembly is it possible to make useful determinations in this area: Relying on
compiler benchmarks is never useful in the real-world, since benchmarks consist
only of specific code sequences you are likely never to use in an identical
way. It is far better to use multiple compilers on your actual code and run it
on your actual target platform, which will then allow you to select the best
compiler overall. I have even selected different compilers for different
modules within a single project, since that allows me to continue to work at a
higher level, and worry less about hand optimization of the assembly output.
When I started using C++, this was an especially critical optimization phase.
On a Windows project (the only one I have *ever* done), the GUI team "had" to
use MS C++ (for MFC access), while my team used Borland C++ for it's better
code generation and far superior adherence to the C++ spec (at that time, at
least). We also used an early DOS port of gcc/g++ to check parts of our code
(better warnings in some situations, and it sometimes generated "strange"
assembly code that was often a learning experience to understand).
In general, when performance is critical, I recommend taking a careful look at
the high-level design and code, and then using multiple compilers, before
I recommend hand-tweaking of the assembler code. That way, even if you do have
to resort to hand tweaking, you have more than one set of generated assembly
code to refer to, which greatly simplifies the overall process.
Of course, that doesn't really much apply to C under Linux, since it is pretty
much wedded to gcc. While other Linux C compilers are available, and more of
them offer "free" binaries or are going Open Source, gcc is still the 600 lb
gorilla in Linux land.
To compare, I recently counted over 150 (!) compilers available for the 8051
family of processors (and their clones and derivatives). It seems that
compilers for "simple" CPUs (where hand assembly is far easier) are like
graphics toolkits under Linux: Everyone wants to write their own.
-BobC