[gentoo-amd64] Re: gcc 4.1.1

Duncan Wed, 07 Jun 2006 01:21:58 -0700

Jani Averbach <[EMAIL PROTECTED]> posted [EMAIL PROTECTED],
excerpted below, on  Tue, 06 Jun 2006 19:08:16 -0600:


> Inspired by your comment, I installed 4.1.1 and did very un-scientific
> test: dcraw compiled [1] with gcc 3.4.5 and 4.1.1. Then convert one raw
> picture with it:
> 
> time dcraw-3 -w test.CR2
> real           0m10.338s
> user           0m9.969s
> sys            0m0.332s
> 
> time dcraw-4 -w test.CR2
> real           0m9.141s
> user           0m8.849s
> sys            0m0.292s
> 
> This is pretty good, and that was only the dcraw, all libraries are still
> done by gcc 3.4.x.
> 
> BR, Jani
> 
> P.S. gcc -march=k8 -o dcraw -O3 dcraw.c -lm -ljpeg -llcms

Very interesting.  I hadn't done any similar direct comparisons, but had
just been amazed at how much more responsive things seem to be with 4.1.x
as compared to 3.4.x.  Given the generally agreed rule of thumb that users
won't definitively notice a difference of less than about 15% performance,
I've estimated an at minimum 20% difference, with everything compiled
4.1.x as compared to 3.4.x.

One test I've always been interested in but have never done, is the effect
of -Os vs -O2 vs -O3.  I think it's generally agreed from testing that -O2
makes a BIG difference as opposed to unoptimized or -O (-O1), but the
differences between -O2, -O3, and -Os, are less clearly defined and, it
would appear, the best depends on what one is compiling.  I know -O3 is
actually supposed to be worse than -O2 in many cases, because the effects
of the loop unrolling and similar optimizations is generally to markedly
increase code size, and the effects of the resulting cache misses and
thereby idling CPU while waiting for memory, is often worse than the
cycles saved by the additional optimization.

For that reason, I've always tended to go what could be argued to be to
the other extreme, favoring -Os over -O2, figuring that in a multitasking
environment, even with today's increased cache sizes, and with main memory
increasingly falling behind CPU speeds (well, until CPU speeds started
leveling off recently as they moved to multi-core instead), -Os should in
general be faster than -O2 for the same reason that -O2 is so often faster
than -O3.

OTOH, there are a couple specific optimizations that can increase overall
code size while increasing cache hit ratios as well, thereby negating the
general cache hit increases of -Os.  Perhaps the most significant of
these, where it can be used, is -freorder-blocks-and-partition.  The
effect of this flag is to cause gcc to try to regroup routines into "hot"
and "cold", with each group in its own "partition".  Hot routines are
those that are called most frequently, cold, the least frequently, so the
effect is that despite a bit of overall increase in code size, the most
frequently used routines will be in cache a much higher percentage of the
time as compared to generally un-reordered routines/blocks.  In theory,
that could dramatically affect performance as the CPU will far more
frequently find the stuff it needs in cache and not have to wait for it to
be retrieved from much slower main memory.  The biggest problem with this
flag is that there's a LOT of code that can't use it, including the
exception code so common to C++. Now gcc does spot that and turn it off, so
no harm done, but it spits out warnings in the process, saying that it
turned it off, and this breaks a lot of ebuilds in the configure step as
they often abort on those warnings when they shouldn't.  As a result, I've
split my CFLAGS from my CXXFLAGS and only include
-freorder-blocks-and-partition in my CFLAGS, omitting it from CXXFLAGS.  I
also have the weaker form -freorder-blocks (without the -and-partition) in
both CFLAGS and CXXFLAGS, so it gets used where the stronger partitioning
form is turned off.  I've not done actual performance tests on this either
way, but I do know the occasional problem with an aborted ebuild I'd have
with the partition version in CXXFLAGS is no longer a problem, with it
only in CFLAGS, and it hasn't seemed to cause me any /problems/ since
then, quite apart from its not-verified-by-me affect on performance.

Likewise with the flags -frename-registers and -fweb.  Since registers are
the fastest of all memory, operating at full CPU speed, and these increase
the efficiency of register allocation, it is IMO worth invoking them even
at the expense of slightly increased code side.  This is likely to be
particularly true on amd64/x86_64, with its increased number of registers
in comparison to x86.  Our arch has those extra registers, we might as
well make as much of them as we possibly can!

Conversely, I don't like the unroll loops flags at all.  It's my
(unverified, no argument there) belief that these will be BIG drags on
performance because they blow up single loop structures to multiple times
their original size, all in an (IMO misguided) effort to inline the loops,
preventing a few jump instructions.  Jumps are far less costly on x86_64
and even full 32-bit i586+ x86 than they used to be on the original 16-bit
8088-80486 generations.  That's particularly true in the case of tight
loops where the entire loop will be in L1 cache.  With proper
pre-fetching, it's /possible/ inline unrolling of the loops could keep the
registers full from L1 and the code running at full CPU speed as opposed
to the slight waits possibly necessary at the loopback jump for a fetch
from L1 instead of being able to continue full-speed register operations,
but I believe it's much more likely that inlining the unrolled loops will
either force code out to L2, or that the prefetching couldn't keep the
registers sufficiently full even with inlining, so there'd be the wait in
any case.  -O2 does a bit of simple loop unrolling, which -Os should
discourage, but -O3 REALLY turns on the unrolling (de)optimizations, if
I'm correctly reading the gcc manpages, anyway.  It's that size intensive
loop unrolling that I most want to discourage, which is why I'd seldom
consider -O3 at all, and the big reason why I favor -Os over -O2, even for
-O2's limited loop unrolling.

However, arguably, -O2's limited loop unrolling is more optimal than
discouraging it with -Os.  I believe it would actually come down to the
code in question, and which is "better" as an overall system CFLAGS choice
very likely depends on exactly what applications one actually chooses to
merge.  It's also very likely dependant on how much multitasking an
individual installation routinely gets, and whether that's single-core or
multi-core/multi-CPU based multi-tasking, plus the specifics of the
sub-arch caching implementation.  (Intel's memory management, particularly
as the number of cores and CPUs increases, isn't at this point as
efficient as AMD's, tho with Conroe Intel is likely to brute-force the
leadership position once again, for the single and dual-core models
normally found on desktops/laptops and low end workstations, anyway,
despite AMD's more elegant memory management currently and as the number
of cores and CPUs scales, 4 and above.)

...

I **HAVE** come across a single **VERY** convincing demonstration of the
problems with gcc 3.x on amd64/x86_64, however.  This one blew me away --
it was TOTALLY unexpected and another guy and I spent quite some
troubleshooting time finding it, as a result.

Those of you using pan as their news client of choice may already be aware
of the fact that there's a newer 0.90+ beta series available.  Portage has
a couple masked ebuilds for the series, but hasn't been keeping up as a
new one has been coming out every weekend (with this past weekend an
exception, Charles, the main developer, took a few days vacation) since
April first.  Therefore, one can either build from source, or do what I've
been doing and rename the ebuild (in my overlay) for each successive
weekly release.  (My overlay ebuild is slightly modified, as well, but no
biggie for this discussion.)

Well, back before gcc-4.1.x was unmasked to ~amd64, one guy on the PAN
groups was having a /terrible/ time compiling the new PAN series, with the
then latest ~amd64 gcc-3.4.x.  With a gigabyte of memory, plus swap, he
kept running into insufficient memory errors.

I wondered how that could be, as I'm running a generally ~amd64 system
myself and had experienced no issues, and while I'm running 8 gig of
memory now, that's a fairly recent upgrade and I had neither experience
problems compiling pan before, nor noticed it using a lot of memory after
the upgrade.  I run ulimit set to a gig of virtual memory (ulimit -v) by
default, and certainly would have expected to run into issues compiling
pan with that if it required that sort of memory.  I routinely /do/ run
into such problems merging kmail, and always have to boost my ulimit
settings to compile it, so I knew it would happen if it really required
that sort of memory to compile.

As it happened, while he was running ~amd64, he wasn't routinely using
--deep with his emerge --update world runs, so he had a number of packages
that were less than the very latest, and that's what he and I focused on
first as the difference between his and my systems, figuring a newer
version of /something/ I had must explain why I had no problem compiling
it while he did.

After he upgraded a few packages with no change in the problem, someone
else mentioned that it might be gcc.  Turned out he was right, it WAS gcc.
With gcc-3.4.x, compiling the new pan on amd64 at one point requires an
incredible 1.3 gigabytes of usable virtual memory for a single compile
job! (That's the makeopts=-jX setting.)  He apparently had enough memory
and swap to do it -- if he shut down X and virtually everything else he
was running -- but was experiencing errors due to lack of one or the
other, with everything he normally had running continuing to run while he
compiled pan.

When out of curiosity I checked how much memory it took with gcc-4.1.0,
the version I was running at the time (tho it was masked for Gentoo
users), I quickly saw why I hadn't noticed a problem -- less than 300 MB
usage at any point.  I /think/ it was actually less than 200, but I didn't
verify that.  In any case, even 300 MB, the gcc 3.4.x was using OVER FOUR
TIMES that, at JUST LESS THAT 1.3 GB required.  No WONDER I hadn't noticed
anything unusual compiling it with gcc-4.1.x, while he had all sorts of
problems with gcc-3.4.x!

I haven't verified this on x86, but I suspect the reason it didn't come up
with anyone else is because it's not a problem with x86.  gcc 3.4.x is
apparently fairly efficient at dealing with 32-bit memory addresses and is
already reasonably optimized for x86.  The same cannot be said for its
treatment of amd64.  While this pan case is certainly an extreme corner
case, it does serve to emphasize the fact that gcc-3.x was simply not
designed for amd64/x86_64, and its x86_64 capacities are and will remain
"bolted on", and as such. far more cumbersome and less efficient than they
/could/ be.  The 4.x rewrite provided the opportunity to change that and
it was taken.  As I've said, however, 4.0 was /just/ the rewrite and didn't
really do much else but try to keep regressions to a minimum.  With the
4.1 series, gcc support for amd64/x86_64 is FINALLY coming into its own,
and the performance improvements dramatically demonstrate that.  The jump
from 3.4.x to 4.1.x is truly the most significant thing to happen to gcc
support for amd64 since support was originally added, and it's probably
the biggest jump we'll ever see, because while improvements will continue
to be made, from this point on, they will be incremental improvements,
significant yes, but not the blow-me-away improvements of 4.1, much as
improvements have been only incremental on x86 for some time.

...

Anyway... thanks for that little test.  The results are certainly
enlightening. I'd /love/ to see some tests of the above -Os vs -O2 vs -O3
and register and reorder flags vs the standard -Ox alone, if you are up to
it, but haven't bothered to run them myself, and just this little test
alone was quite informative and definitely more concrete than the "feel"
I've been basing my comments on to date.  Hopefully, my above comments
prove useful to someone as well, and if I'm lucky, motivation for some
tests (by you or someone else) to prove or disprove them.  =8^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

-- 
[email protected] mailing list

[gentoo-amd64] Re: gcc 4.1.1

Reply via email to