Jani Averbach <[EMAIL PROTECTED]> posted [EMAIL PROTECTED], excerpted below, on Tue, 06 Jun 2006 19:08:16 -0600:
> Inspired by your comment, I installed 4.1.1 and did very un-scientific > test: dcraw compiled [1] with gcc 3.4.5 and 4.1.1. Then convert one raw > picture with it: > > time dcraw-3 -w test.CR2 > real 0m10.338s > user 0m9.969s > sys 0m0.332s > > time dcraw-4 -w test.CR2 > real 0m9.141s > user 0m8.849s > sys 0m0.292s > > This is pretty good, and that was only the dcraw, all libraries are still > done by gcc 3.4.x. > > BR, Jani > > P.S. gcc -march=k8 -o dcraw -O3 dcraw.c -lm -ljpeg -llcms Very interesting. I hadn't done any similar direct comparisons, but had just been amazed at how much more responsive things seem to be with 4.1.x as compared to 3.4.x. Given the generally agreed rule of thumb that users won't definitively notice a difference of less than about 15% performance, I've estimated an at minimum 20% difference, with everything compiled 4.1.x as compared to 3.4.x. One test I've always been interested in but have never done, is the effect of -Os vs -O2 vs -O3. I think it's generally agreed from testing that -O2 makes a BIG difference as opposed to unoptimized or -O (-O1), but the differences between -O2, -O3, and -Os, are less clearly defined and, it would appear, the best depends on what one is compiling. I know -O3 is actually supposed to be worse than -O2 in many cases, because the effects of the loop unrolling and similar optimizations is generally to markedly increase code size, and the effects of the resulting cache misses and thereby idling CPU while waiting for memory, is often worse than the cycles saved by the additional optimization. For that reason, I've always tended to go what could be argued to be to the other extreme, favoring -Os over -O2, figuring that in a multitasking environment, even with today's increased cache sizes, and with main memory increasingly falling behind CPU speeds (well, until CPU speeds started leveling off recently as they moved to multi-core instead), -Os should in general be faster than -O2 for the same reason that -O2 is so often faster than -O3. OTOH, there are a couple specific optimizations that can increase overall code size while increasing cache hit ratios as well, thereby negating the general cache hit increases of -Os. Perhaps the most significant of these, where it can be used, is -freorder-blocks-and-partition. The effect of this flag is to cause gcc to try to regroup routines into "hot" and "cold", with each group in its own "partition". Hot routines are those that are called most frequently, cold, the least frequently, so the effect is that despite a bit of overall increase in code size, the most frequently used routines will be in cache a much higher percentage of the time as compared to generally un-reordered routines/blocks. In theory, that could dramatically affect performance as the CPU will far more frequently find the stuff it needs in cache and not have to wait for it to be retrieved from much slower main memory. The biggest problem with this flag is that there's a LOT of code that can't use it, including the exception code so common to C++. Now gcc does spot that and turn it off, so no harm done, but it spits out warnings in the process, saying that it turned it off, and this breaks a lot of ebuilds in the configure step as they often abort on those warnings when they shouldn't. As a result, I've split my CFLAGS from my CXXFLAGS and only include -freorder-blocks-and-partition in my CFLAGS, omitting it from CXXFLAGS. I also have the weaker form -freorder-blocks (without the -and-partition) in both CFLAGS and CXXFLAGS, so it gets used where the stronger partitioning form is turned off. I've not done actual performance tests on this either way, but I do know the occasional problem with an aborted ebuild I'd have with the partition version in CXXFLAGS is no longer a problem, with it only in CFLAGS, and it hasn't seemed to cause me any /problems/ since then, quite apart from its not-verified-by-me affect on performance. Likewise with the flags -frename-registers and -fweb. Since registers are the fastest of all memory, operating at full CPU speed, and these increase the efficiency of register allocation, it is IMO worth invoking them even at the expense of slightly increased code side. This is likely to be particularly true on amd64/x86_64, with its increased number of registers in comparison to x86. Our arch has those extra registers, we might as well make as much of them as we possibly can! Conversely, I don't like the unroll loops flags at all. It's my (unverified, no argument there) belief that these will be BIG drags on performance because they blow up single loop structures to multiple times their original size, all in an (IMO misguided) effort to inline the loops, preventing a few jump instructions. Jumps are far less costly on x86_64 and even full 32-bit i586+ x86 than they used to be on the original 16-bit 8088-80486 generations. That's particularly true in the case of tight loops where the entire loop will be in L1 cache. With proper pre-fetching, it's /possible/ inline unrolling of the loops could keep the registers full from L1 and the code running at full CPU speed as opposed to the slight waits possibly necessary at the loopback jump for a fetch from L1 instead of being able to continue full-speed register operations, but I believe it's much more likely that inlining the unrolled loops will either force code out to L2, or that the prefetching couldn't keep the registers sufficiently full even with inlining, so there'd be the wait in any case. -O2 does a bit of simple loop unrolling, which -Os should discourage, but -O3 REALLY turns on the unrolling (de)optimizations, if I'm correctly reading the gcc manpages, anyway. It's that size intensive loop unrolling that I most want to discourage, which is why I'd seldom consider -O3 at all, and the big reason why I favor -Os over -O2, even for -O2's limited loop unrolling. However, arguably, -O2's limited loop unrolling is more optimal than discouraging it with -Os. I believe it would actually come down to the code in question, and which is "better" as an overall system CFLAGS choice very likely depends on exactly what applications one actually chooses to merge. It's also very likely dependant on how much multitasking an individual installation routinely gets, and whether that's single-core or multi-core/multi-CPU based multi-tasking, plus the specifics of the sub-arch caching implementation. (Intel's memory management, particularly as the number of cores and CPUs increases, isn't at this point as efficient as AMD's, tho with Conroe Intel is likely to brute-force the leadership position once again, for the single and dual-core models normally found on desktops/laptops and low end workstations, anyway, despite AMD's more elegant memory management currently and as the number of cores and CPUs scales, 4 and above.) ... I **HAVE** come across a single **VERY** convincing demonstration of the problems with gcc 3.x on amd64/x86_64, however. This one blew me away -- it was TOTALLY unexpected and another guy and I spent quite some troubleshooting time finding it, as a result. Those of you using pan as their news client of choice may already be aware of the fact that there's a newer 0.90+ beta series available. Portage has a couple masked ebuilds for the series, but hasn't been keeping up as a new one has been coming out every weekend (with this past weekend an exception, Charles, the main developer, took a few days vacation) since April first. Therefore, one can either build from source, or do what I've been doing and rename the ebuild (in my overlay) for each successive weekly release. (My overlay ebuild is slightly modified, as well, but no biggie for this discussion.) Well, back before gcc-4.1.x was unmasked to ~amd64, one guy on the PAN groups was having a /terrible/ time compiling the new PAN series, with the then latest ~amd64 gcc-3.4.x. With a gigabyte of memory, plus swap, he kept running into insufficient memory errors. I wondered how that could be, as I'm running a generally ~amd64 system myself and had experienced no issues, and while I'm running 8 gig of memory now, that's a fairly recent upgrade and I had neither experience problems compiling pan before, nor noticed it using a lot of memory after the upgrade. I run ulimit set to a gig of virtual memory (ulimit -v) by default, and certainly would have expected to run into issues compiling pan with that if it required that sort of memory. I routinely /do/ run into such problems merging kmail, and always have to boost my ulimit settings to compile it, so I knew it would happen if it really required that sort of memory to compile. As it happened, while he was running ~amd64, he wasn't routinely using --deep with his emerge --update world runs, so he had a number of packages that were less than the very latest, and that's what he and I focused on first as the difference between his and my systems, figuring a newer version of /something/ I had must explain why I had no problem compiling it while he did. After he upgraded a few packages with no change in the problem, someone else mentioned that it might be gcc. Turned out he was right, it WAS gcc. With gcc-3.4.x, compiling the new pan on amd64 at one point requires an incredible 1.3 gigabytes of usable virtual memory for a single compile job! (That's the makeopts=-jX setting.) He apparently had enough memory and swap to do it -- if he shut down X and virtually everything else he was running -- but was experiencing errors due to lack of one or the other, with everything he normally had running continuing to run while he compiled pan. When out of curiosity I checked how much memory it took with gcc-4.1.0, the version I was running at the time (tho it was masked for Gentoo users), I quickly saw why I hadn't noticed a problem -- less than 300 MB usage at any point. I /think/ it was actually less than 200, but I didn't verify that. In any case, even 300 MB, the gcc 3.4.x was using OVER FOUR TIMES that, at JUST LESS THAT 1.3 GB required. No WONDER I hadn't noticed anything unusual compiling it with gcc-4.1.x, while he had all sorts of problems with gcc-3.4.x! I haven't verified this on x86, but I suspect the reason it didn't come up with anyone else is because it's not a problem with x86. gcc 3.4.x is apparently fairly efficient at dealing with 32-bit memory addresses and is already reasonably optimized for x86. The same cannot be said for its treatment of amd64. While this pan case is certainly an extreme corner case, it does serve to emphasize the fact that gcc-3.x was simply not designed for amd64/x86_64, and its x86_64 capacities are and will remain "bolted on", and as such. far more cumbersome and less efficient than they /could/ be. The 4.x rewrite provided the opportunity to change that and it was taken. As I've said, however, 4.0 was /just/ the rewrite and didn't really do much else but try to keep regressions to a minimum. With the 4.1 series, gcc support for amd64/x86_64 is FINALLY coming into its own, and the performance improvements dramatically demonstrate that. The jump from 3.4.x to 4.1.x is truly the most significant thing to happen to gcc support for amd64 since support was originally added, and it's probably the biggest jump we'll ever see, because while improvements will continue to be made, from this point on, they will be incremental improvements, significant yes, but not the blow-me-away improvements of 4.1, much as improvements have been only incremental on x86 for some time. ... Anyway... thanks for that little test. The results are certainly enlightening. I'd /love/ to see some tests of the above -Os vs -O2 vs -O3 and register and reorder flags vs the standard -Ox alone, if you are up to it, but haven't bothered to run them myself, and just this little test alone was quite informative and definitely more concrete than the "feel" I've been basing my comments on to date. Hopefully, my above comments prove useful to someone as well, and if I'm lucky, motivation for some tests (by you or someone else) to prove or disprove them. =8^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- [email protected] mailing list
