Re: g++, trunk, recent weird mismatch for arguments with forwarded declaration when attributes are involved
Hello, i know it's no good form to reply to self, or be that insistent, but i've been hit again. In the bug report discussion, i've been told by A. Pinski that, as of now, forward declarations shall have matching attributes. That's fine, i suppose. What's not is that: . that new behavior, as far as i know, isn't documented anywhere. . there's no warning or error at the declaration/definition point. . it's not consistent (non-compliance only fail in some unknown condition). . when you finally get an error, it will be about a vaguely related prototype mismatch somewhere. Would it be possible to have some clarifications? Shall i file a PR for a warning? Sacrifice a goat? PS: now i know better, but i can assure you, anyone running into that issue is bound to waste tremendous amounts of time trying to figure out what's wrong with their prototype.
-Wdouble-promotion noise
Hello, I could really use -Wdouble-promotion but, atm, it appears quite impractical, $ cat double.cc #include cstdio void foo(...); int main() { float f = 1; foo(f); printf(%f, f); } $ /usr/local/gcc-4.6-20100913/bin/g++ -Wdouble-promotion double.cc double.cc: In function 'int main()': double.cc:5:7: warning: implicit conversion from 'float' to 'double' when passing argument to function [-Wdouble-promotion] double.cc:6:16: warning: implicit conversion from 'float' to 'double' when passing argument to function [-Wdouble-promotion] ... and the interesting bits are lost in the noise. I can't think of a workaround. So i have to ask: Is that how it's meant to be, or simply a temporary shortcoming? Have i missed an obvious kludge?
Re: g++, trunk, recent weird mismatch for arguments with forwarded declaration when attributes are involved
On Tue, Sep 14, 2010 at 4:51 PM, Ian Lance Taylor i...@google.com wrote: Please do file a PR if there isn't one already. Thanks. I have no idea if that could happen outside C++ and couldn't find anything relevant, thus http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45668 That's the best i can do. And thanks for your assistance.
Re: -Wdouble-promotion noise
On Tue, Sep 14, 2010 at 4:58 PM, Ian Lance Taylor i...@google.com wrote: This question is not appropriate for the mailing list g...@gcc.gnu.org. ... This is among the kinds of things which -Wdouble-promotion is documented to warn about, so, yes, this is how it's meant to be. Honestly i've pondered not sending that previous mail, but, i guess i just disagreed with the prescribed use-case; frankly i wonder how one can expect end-users to forgo or patch each every variadic function to use it. I'll get back grepping for *pd/*sd and the likes as is it infinitely more practical. Sorry for fuss.
Re: -Wdouble-promotion noise
On Tue, Sep 14, 2010 at 7:14 PM, Ian Lance Taylor i...@google.com wrote: What is it that you want? I'd like to have a warning for when a value of type float is implicitly promoted to double, for performance reasons (on x86). Note that in that context, caring about variadic functions makes little sense to begin with (by the time prologue is done, any notion of performance is a fairy tale). I can't use -Wconversion, way too much noise. -Wunsuffixed-float-constants, unavailable in C++. -fsingle-precision-constant, indiscriminate and wrong. either, and, because there may be some debugging/pretty-printing code around, -Wdouble-promotion is useless. So, i'm back to grep. Tho, i got to say, it was really evil to tease me like that with -Wdouble-promotion :)
Re: -Wdouble-promotion noise
On Tue, Sep 14, 2010 at 8:44 PM, Ian Lance Taylor i...@google.com wrote: Let me put it a different way: what is it that you want, expressed in terms of C/C++ code? What should the compiler be warning about? Hmm. I think the provided example captures most of what i care about, float area(float radius) { return 3.14159 * radius * radius; } and even forgotten/disguised uses of, say, M_PI, would fall in that category after-all. -Wunsuffixed-float-constants, unavailable in C++. It seems that this could be added to C++ easily enough. If i'm not mistaken (i've never put it to real use) -Wunsuffixed-float-constants would handle that. So, it wouldn't be as airtight as -Wdouble-promotion but still good enough and a fantastic improvement.
Re: -Wdouble-promotion noise
On Tue, Sep 14, 2010 at 9:47 PM, Ian Lance Taylor i...@google.com wrote: So far my best guess is that your definition is warn about implicit conversions from float to double except for those conversions caused by default argument promotion applied to arguments passed to unnamed parameters. Is that what you want? Hypothetically i'd grok what you just wrote and answer clearly, but i'm no language lawyer and only use compilers: I don't know. I know what i don't want: be surprised to find unwanted non-single-precision instructions or data in critical parts of the binary at the far end of the tool chain. That's my definition ;)
g++, trunk, recent weird mismatch for arguments with forwarded declaration when attributes are involved
Since about 2010/09/07 i've had a weird error with a mismatched prototype involving an argument once forward declared as 'class foo;' and later defined as 'class __attribute((aligned(16))) foo {...};', a bit like namespace n1 { class fwd; namespace n2 { class foo { void bar(fwd ); }; } class __attribute((aligned(16))) fwd {}; } // error prototype for... candidate is... would be here. void n1::n2::foo::bar(n1::fwd ) {} Except that's no testcase because it works: to kludge around, i have to forward declare with matching attributes (or get an error). I fail to reduce it, and the source code is large fugly; i'd prefer not to have to disclose, hence no formal bug report. My hope being that will ring a bell for whoever's responsible :) PS: x86-64/32, linux, -std=c++0x.
Re: g++, trunk, recent weird mismatch for arguments with forwarded declaration when attributes are involved
On Fri, Sep 10, 2010 at 5:20 PM, Ian Lance Taylor i...@google.com wrote: Since you do have a test case, you could try using a tool like delta to reduce it to something that you can share. My delta-fu is too weak to get anywhere with an error so easily produced (mismatched prototype, plus g++ senseless diagnostic doesn't help either). I've given up and submitted http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45642. But thanks for the input.
Re: g++ 4.5.0, end-user disappointment and interrogations
On Thu, Apr 22, 2010 at 6:36 AM, Paolo Carlini paolo.carl...@oracle.com wrote: In any case, keep in mind that constexpr are not available yet, maybe the parser can already recognize some uses but the semantics is not done yet. Ah, so it was nothing but smokes mirrors. Thanks for the clarification.
Re: g++ 4.5.0, end-user disappointment and interrogations
On Thu, Apr 22, 2010 at 7:23 AM, Xinliang David Li davi...@google.com wrote: The dead store problem seems to be a regression in SRA. Thanks for looking into it. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43846
Re: g++ 4.5.0, end-user disappointment and interrogations
On Fri, Apr 23, 2010 at 5:48 AM, Dave Korn dave.korn.cyg...@googlemail.com wrote: Dear tbp, please don't accuse people of being deceptive or fraudulent, it is not a nice thing to do. Indeed. That wasn't the intent. Seeing libstdc++ being combed over for constexpr, i've conveniently fooled myself into believing my hopes were realised. That i am ignorant, hermetic to facts and terminally clueless is, i fear, congruent to my current condition of end-user. PS: If not for the risk of further aggravation, i'd contend that i have nowhere accused people; but that's best forgotten.
g++ 4.5.0, end-user disappointment and interrogations
Hello, having finally built myself a 4.5.0 (linux x86-64), i've quickly tried it on some of my code and it soon became apparent some things weren't for the better. Here's my febrile attempt to sum up what surprised me $ cat huh.cc #include cmath #if __GNUC__ * 100 + __GNUC_MINOR__ 405 #define constexpr #endif struct foo_t { float x, y, z; foo_t() {} constexpr foo_t(float a, float b, float c) : x(a),y(b),z(c) {} friend foo_t operator*(foo_t lhs, float s) { return foo_t(lhs.x*s, lhs.y*s, lhs.z*s); } friend float dot(foo_t lhs, foo_t rhs) { return lhs.x*rhs.x + lhs.y*rhs.y + lhs.z*rhs.z; } }; struct bar_t { float m[3]; bar_t() {} constexpr bar_t(float a, float b, float c) : m{a, b, c} {} friend bar_t operator*(bar_t lhs, float s) { return bar_t(lhs.m[0]*s, lhs.m[1]*s, lhs.m[2]*s); } friend float dot(bar_t lhs, bar_t rhs) { return lhs.m[0]*rhs.m[0] + lhs.m[1]*rhs.m[1] + lhs.m[2]*rhs.m[2]; } }; namespace { templatetypename T float magsqr(T v) { return dot(v, v); } templatetypename T T norm(T v) { return v*(1/std::sqrt(magsqr(v))); } constexpr foo_t foo(1, 2, 3); constexpr bar_t bar(1, 2, 3); } void frob1(const foo_t a, foo_t b) { b = norm(a); } void frob2(const bar_t a, bar_t b) { b = norm(a); } int main() { return 0; } $ g++ -std=c++0x -O3 -march=native -ffast-math -mno-recip huh.cc a) Code produced for frob1 and frob2 differ (a dead store isn't removed with the array variant), when they used not to (for example with g++ 4.4.1); that's a really annoying regression (can't index foo_t members etc...). b) Note the rsqrtss in there: -ffast-math turns -funsafe-math-optimizations on which, now, also turns on -freciprocal-math; the old -m[no-]recip switch that used to direct the emission of reciprocals is useless; no warnings of any sort emitted. The only mention of the new behaviour is in the manual (nothing in http://gcc.gnu.org/gcc-4.5/changes.html). c) constexpr apparently makes no difference, stuff still gets constructed/stored at runtime. Vectors aren't allowed either: error: parameter '__vector(4) float v' is not of literal type; even if that's what the standard say, it would have been handy. Q: Is the dead store removal/fuss with arrays a known/transient issue soon to be fixed (again)? Would it be possible to foolproof -ffast-math/-freciprocal-math/-mrecip in some way? What's the deal with constexpr (or what can i reasonably expect)?
Re: __attribute__((optimize)) and fast-math related oddities
On Mon, Oct 19, 2009 at 7:34 PM, Ian Lance Taylor i...@google.com wrote: Please file a bug report. __attribute__((optimize())) is definitely only half-baked. Apparently the code i've posted is just a variation around that 1 year old PR 37565 and if that doesn't work, worrying about the rest is entirely futile. Half baked you say? It's comforting to see that much optimism but couldn't the doc be adjusted a bit to reflect the fact that the baker got hit by a bus or something? PS: i'm sorry that i've missed that PR in my search, but i presumed the issue was much more specific.
__attribute__((optimize)) and fast-math related oddities
Hang on while i put on my flame-proof suit. There. Merrily trying to make a test-case showing how unmanageable it is to try to override *math* flags per function, i soon had to stop because... $ cat amusing.cc #include cmath static __attribute__((optimize(-fno-associative-math))) double foo1(double x) { return (x + pow(2, 52)) - pow(2, 52); } static __attribute__((noinline)) double bar1(double x) { return foo1(x); } #ifdef HUH static __attribute__((optimize(-fno-associative-math))) double foo2(double x) { return (x + pow(2, 52)) - pow(2, 52); } static __attribute__((noinline, optimize(-ffast-math))) double bar2(double x) { return foo2(x); } #endif int main() { double x = 1.1; if (bar1(x) == x) return 1; #ifdef HUH if (bar2(x) == x) return 2; #endif return 0; } $ g++-4.4 -O2 amusing.cc -ffast-math ./a.out; echo $? 0 $ g++-4.4 -O2 amusing.cc -ffast-math -DHUH ./a.out; echo $? 1 $ g++-4.4 -O2 amusing.cc ./a.out; echo $? 0 $ g++-4.4 -O2 amusing.cc -DHUH ./a.out; echo $? 1 ... made even less sense than expected. I got that like for other 'incompatible' flags, conflicting math flags should prevent inlining, only they don't. And it's all weird. But that one takes the cake. Could someone tell me what the fuss is about? $ g++-4.4 -v Using built-in specs. Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 4.4.1-6' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 --program-suffix=-4.4 --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --enable-objc-gc --with-arch-32=i486 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.4.1 (Debian 4.4.1-6)
How are we supposed to play along the autovectorizer in c++? (alignment issues)
Hello. the autovectorizer is enabled by default in g++ 4.3 and does a fine job most of the time. Except it gets mightily pissed off if you dare to tweak the alignment and after much experimentation i haven't yet devised how to plug all the holes. This silly example shows where things start to get ugly # cat autovec.cc enum { N = 4, align_to = 16/sizeof(char) }; typedef float scalar_type; struct foo_t { scalar_type m[N]; foo_t operator +(const foo_t rhs) const { foo_t v(*this); for (unsigned i=0; iN; ++i) v.m[i] += rhs.m[i]; return v; } }; struct bar_t { scalar_type __attribute__((aligned(sizeof(char)*align_to))) m[N]; bar_t operator +(const bar_t rhs) const { bar_t v(*this); for (unsigned i=0; iN; ++i) v.m[i] += rhs.m[i]; return v; } }; templatetypename T __attribute__((noinline)) void foobar(T dst, const T *src) { T v = {{ 0 }}; for (unsigned i=0; i64; ++i) v = v + src[i]; dst = v; } int main(int argc, char *argv[]) { foo_t *p((foo_t*) argv); bar_t *q((bar_t*) argv); foobar(*p, p + 1); foobar(*q, q + 1); return 0; } # g++ -O3 -march=native autovec.cc # g++ 4.3.1, x86_64 There's not much to say about foobarfoo_t and the addition in foobarbar_t gets somewhat vectorized but 400620: 89 54 24 f4 mov%edx,-0xc(%rsp) 400624: 89 4c 24 f0 mov%ecx,-0x10(%rsp) 400628: 44 89 44 24 ec mov%r8d,-0x14(%rsp) 40062d: 44 89 4c 24 e8 mov%r9d,-0x18(%rsp) 400632: 0f 28 c1movaps %xmm1,%xmm0 400635: 0f 12 04 06 movlps (%rsi,%rax,1),%xmm0 400639: 0f 16 44 06 08 movhps 0x8(%rsi,%rax,1),%xmm0 40063e: 48 83 c0 10 add$0x10,%rax 400642: 41 0f 58 02 addps (%r10),%xmm0 400646: 48 3d 00 04 00 00 cmp$0x400,%rax 40064c: 41 0f 29 02 movaps %xmm0,(%r10) 400650: 8b 54 24 f4 mov-0xc(%rsp),%edx 400654: 8b 4c 24 f0 mov-0x10(%rsp),%ecx 400658: 44 8b 44 24 ec mov-0x14(%rsp),%r8d 40065d: 44 8b 4c 24 e8 mov-0x18(%rsp),%r9d 400662: 75 bc jne400620 void foobarbar_t(bar_t, bar_t const*)+0x20 as you can see there's a lot of undue load/store. And that's for a POD (or something really looking like one). So, you start fixing that with some looping copy ctor/operator (surely losing the POD property in the process) and so on. Doing that i can fix most reload issues, but stores are much more elusive (note that it depends on the underlying type its natural alignment). Ideally i'd like PODs to remain PODs, and synthetized ctor/operators to be efficient (ie not falling back to using gpr based memcpy when everything is in an XMM register already), or at least a consistent way how such ctor/operators can be written (and dead store removed). Briefly: how am i supposed to decorate my structures with larger aligment and not royally piss off the autovectorizer (and g++ in general)?
Re: censored naked SSE reciprocals, -mrecip
On Dec 29, 2007 4:35 PM, Uros Bizjak [EMAIL PROTECTED] wrote: Attached patch fixes these problems by using correct shortcuts when generating intrinsic functions. Patch was bootstrapped and regression tested with {,-m32} on x86_64-pc-linux-gnu. Patch is committed to SVN. Thanks a lot for your report, Now that's blazing fast after-sales service. And i get no less than two undocumented but functional builtins (as opposed to, say __builtin_ia32_movddup, which is documented but dysfunctional) for the same price. As an extremely satisfied customer, i want to nominate you for the 2007 man of the year short list.
censored naked SSE reciprocals, -mrecip
Merry xmas, i lately had some use for -mrecip but it turned out to come with all sorts of strings attached and apparently no opt-out. Briefly, barring inline asm, i can't get gcc to emit those ops without a NR fixup. # cat src/pr-recip.c #include xmmintrin.h typedef float v4sf_t __attribute__ ((__vector_size__ (16))); __m128 foo(__m128 a) { return _mm_sqrt_ps(a); } __m128 bar(__m128 a) { return _mm_rsqrt_ps(a); } __m128 baz(__m128 a) { return _mm_rcp_ps(a); } v4sf_t nope1(v4sf_t a) { return __builtin_ia32_sqrtps(a); } v4sf_t nope2(v4sf_t a) { return __builtin_ia32_rsqrtps(a); } v4sf_t allright(v4sf_t a) { return __builtin_ia32_rcpps(a); } int main() { return 0; } # /usr/local/gcc-4.3-20071221/bin/gcc -march=native -ffast-math -mrecip -O2 src/pr-recip.c ... and as can be witnessed in the attached asm dump foo, bar, nope1, nope2 get mangled (at least on x86-64 linux). While i can somehow understand the logic behind the automatic transformation of _mm_sqrt_ps - it can be argued that's what the user has asked for - there's no obvious way to opt out. But then i really don't understand why gcc feels the urge to tinker when i specifically ask for a rsqrt. To add insult to injury -mrecip, unlike fast-math, doesn't set any macro so kludging around is a cat / mouse game. Questions: a) is that really by design? b) what's the official way to dodge fixups when -mrecip is active? c) any chance for -mrecip to set __FAST_MATH_NONE_SHALL_PASS__ or something? dump.asm Description: Binary data
Re: Function specific optimizations call for discussion
On Nov 29, 2007 9:29 PM, Weddington, Eric [EMAIL PROTECTED] wrote: and I would also postulate the general embedded community, would *really* like to have this functionality, especially your Stage 1. There are many AVR, or embedded, applications where they are generally optimized for size, but have a time-critical function that needs to be optimized for speed. I would personally, and i think it hasn't been evoked yet, *really* like to be able to toggle fast-math (or related flags) per function, basically for the same reason.
Re: recent troubles with float vectors bitwise ops
Mark Mitchell wrote: One option is for the user to use intrinsics. It's been claimed that results in worse code. There doesn't seem any obvious reason for that, but, if true, we should try to fix it; we don't want to penalize people who are using the intrinsics. So, let's assume using intrinsics is just as efficient, either because it already is, or because we make it so. I maintain that empirical claim; if i compare what gives a simple SOA hybrid 3 coordinates something implemented via intrinsics, builtins and vector when used as the basic component for a raytracer kernel i get as many codegen variations: register allocations differ, stack footprints differ, branches code organization differ, etc... so it's not that surprising performance also differ. It appears the vector builtin (which isn't using __m128 but straight v4sf) implementations are mostly on par while the intrinsic based version is slightly slower. Then you factor in how convenient it is, well... was, to use that vector extension to write such something... Another issue is that for MSVC and ICC, __m128 is a class, but not for gcc so you need more wrapping in C++ but if you know you can let some naked v4sf escape because the compiler always does the right thing with them. Now while there's some subtleties (and annoying 'features'), i should state that gcc4.3, if you're careful, generates mostly excellent SSE code (especially on x86-64, even more so if compared to icc). We still have the problem that users now can't write machine-independent code to do this operation. Assuming the operations are useful for That and writing, say, a generic int,float,double something takes much much more work. What are these operation used for? Can someone give an example of a kernel than benefits from this kind of thing? There's of course what Paolo Bonzini described, but also all kind tricks that knowing such operations are extremely efficient encourages. While it would be nice to have such builtins also operate on vectors, if only because they are so common, it's not quite the same as having full freedom and hardware features exposed.
Re: recent troubles with float vectors bitwise ops
Paolo Bonzini wrote: I'm not sure that it is *so* useful for a user to have access to it, except for specialized cases: As there's other means, it may not be that useful but for sure it's extremely convenient. 2) selection operations on vectors, kind of (v1 = v2 ? v3 : v4). These can be written for example like this: cmpleps xmm1, xmm2 ; xmm1 = xmm1 = xmm2 ? all-ones : 0 andnps xmm4, xmm1 ; xmm4 = xmm1 = xmm2 ? 0 : xmm4 andps xmm1, xmm3 ; xmm1 = xmm1 = xmm2 ? xmm3 : 0 orpsxmm1, xmm4 ; xmm1 = xmm1 = xmm2 ? xmm3 : xmm4 I suppose you'll find such variant of a conditional move pattern in every piece of SSE code. But you can't condense bitwise vs float usage to a few patterns because when writing SSE, the efficiency of those operations is taken for granted. If we have a good extension for vector arithmetic, we should aim at improving it consistently rather than extending it in unpredictable ways. For example, another useful extension would be the ability to access vectors by item using x[n] (at least with constant expressions). Yes, yes and yes.
Re: recent troubles with float vectors bitwise ops
Ross Ridge wrote: If I were tbp, I'd just code all his vector operatations using intrinsics. The other responses in this thread have made it clear that GCC's vector arithemetic operations are really only designed to be used with the Cell Broadband Engine and other Power PC processors. Thing is my main use for that extension is for a specialization (made on a rainy day out of boredom) of a basic something re-used all over in my code; the default implementation uses intrinsics. It turns out, when benchmarked, that i get better code with the specialization. So it's more convenient and faster, win/win. I'm unsure why the code is better in the end, perhaps because the may_alias attribute of __m128, perhaps because some builtins which are used to implement those intrinsics are mistyped (ie v4si __builtin_ia32_cmpltps (v4sf, v4sf))... i don't know, i'd need to try a builtin based specialization. In any case that vector extension is now totally useless on x86 and conflicts with the documentation.
Re: recent troubles with float vectors bitwise ops
Andrew Pinski wrote: Which hardware (remember GCC is a generic compiler)? VMX/Altivec and SPU actually does not have different instructions for bitwise and/ior/xor for different vector types (it is all the same instruction). I have ran into ICEs with even bitwise on vector float/double on x86 also in the past which is the other reason why I disabled them. Since this is an extension, it would be nice if it was nicely defined extension which means disabling them for vector float/double. It *was* neatly defined: The types defined in this manner can be used with a subset of normal C operations. Currently, GCC will allow using the following operators on these types: +, -, *, /, unary minus, ^, |, , ~.. So can you, pretty please, also patch the documentation and maybe point to the Altivec spec as it's obviously the only one relevant no matter what platform you're on?
Re: recent troubles with float vectors bitwise ops
Paolo Bonzini wrote: To some extent I agree with Andrew Pinski here. Saying that you need support in a generic vector extension for vector float | vector float in order to generate ANDPS and not PXOR, is just wrong. That should be done by the back-end. I guess i fail to grasp the logic mandating that the intended source level, strictly typed, 'vector float | vector float' should be mangled into an int op with frantic casts to magically emerge out from the backend as the original 'vector float | vector float', but i'm not a compiler maintener: for me it smells like a regression.
Re: recent troubles with float vectors bitwise ops
Paolo Bonzini wrote: Because it's *not* strictly typed. Strict typing means that you accept the same things accepted for the element type. So it's not a regression, it's a bug fix. # cat regressionorbugfix.cc typedef float v4sf_t __attribute__ ((__vector_size__ (16))); typedef int v4si_t __attribute__ ((__vector_size__ (16))); v4sf_t foo(v4sf_t a, v4sf_t b, v4sf_t c) { return a + (b | c); } v4sf_t bar(v4sf_t a, v4sf_t b, v4sf_t c) { return a + (v4sf_t) ((v4si_t) b | (v4si_t) c); } int main() { return 0; } 00400a30 foo(float __vector, float __vector, float __vector): 400a30: orps %xmm2,%xmm1 400a33: addps %xmm1,%xmm0 400a36: retq 00400a40 bar(float __vector, float __vector, float __vector): 400a40: por%xmm2,%xmm1 400a44: addps %xmm1,%xmm0 400a47: retq I'm surely not qualified to argue about typing, but you'd need a rather strong distortion field to not characterize that as a regression.
Re: recent troubles with float vectors bitwise ops
On 8/23/07, Paolo Bonzini [EMAIL PROTECTED] wrote: I've added 5 minutes ago an XFAILed test for exactly this code. OTOH, I have also committed a fix that will avoid producing tons of shuffle and unpacking instructions when function bar is compiled with -msse but without -msse2. Thanks. I'm also going to file a missed optimization bug soon. Ditto. I'm curious, does ICC support vector arithmetic like this? Do both functions compile? What code does it produce for bar? No, icc9/10 only provide basic support for that extension (and then only on linux i think) # /opt/intel/cce/9.1.051/bin/icpc regressionorbugfix.cc regressionorbugfix.cc(5): error: no operator | matches these operands operand types are: v4sf_t | v4sf_t return a + (b | c); ^ regressionorbugfix.cc(8): error: no operator | matches these operands operand types are: v4si_t | v4si_t return a + (v4sf_t) ((v4si_t) b | (v4si_t) c); ^ but then it's more aggressive about intrinsics than gcc. Like i said somewhere i got slightly better results when using that extension than intrinsics with gcc 4.3 but haven't checked if i could get the same result with builtins yet.
Re: recent troubles with float vectors bitwise ops
On 8/23/07, Tim Prince [EMAIL PROTECTED] wrote: The primary icc/icl use of SSE/SSE2 masking operations, of course, is in the auto-vectorization of fabs[f] and conditional operations: sum = 0.f; i__2 = *n; for (i__ = 1; i__ = i__2; ++i__) if (a[i__] 0.f) sum += a[i__]; (Windows/intel asm syntax) pxor xmm2, xmm2 cmpltps xmm2, xmm3 andps xmm3, xmm2 addps xmm0, xmm3 ... Note that icc9 has a strong bias for pentium4, which had no stall penalty for mistyped fp vectors as for Intel it came with the pentium M line, so you see a pxor even if generating code for the core2. # cat autoicc.cc float foo(const float *a, int n) { float sum = 0.f; for (int i = 0; i n; ++i) if (a[i] 0.f) sum += a[i]; return sum; } int main() { return 0; } # /opt/intel/cce/9.1.051/bin/icpc -O3 -xT autoicc.cc autoicc.cc(3) : (col. 2) remark: LOOP WAS VECTORIZED. 4007a9: pxor %xmm4,%xmm4 4007ad: cmpltps %xmm3,%xmm4 4007b1: andps %xmm3,%xmm4 # /opt/intel/cce/10.0.023/bin/icpc -O3 -xT autoicc.cc autoicc.cc(3): (col. 2) remark: LOOP WAS VECTORIZED. 400b50: xorps %xmm3,%xmm3 400b53: cmpltps %xmm4,%xmm3 400b57: andps %xmm3,%xmm4
Re: recent troubles with float vectors bitwise ops
On 8/22/07, Paolo Bonzini [EMAIL PROTECTED] wrote: I think you're running too far with your sarcasm. SSE's instructions do not go so far as to specify integer vs. floating point. To me, ps means 32-bit SIMD, independent of integerness. Excuse me if i'm amazed being replied bitwise ops on floating values make no sense as the justification for breaking something that used to work and match hardware features. I naively thought that was the purpose of that convenient extension. So, that's what i feared... it was intentional. And now i guess the only sanctioned access to those ops is via builtins/intrinsics. No, you can do so with casts. Floating-point to integer vector casts preserve the bit pattern. For example, you can do Again SIMD ops (among them bitwise stuff) comes in 3 mostly symmetric flavors on x86 namely for int, float and doubles; casting isn't innocuous because there's a penalty for type mismatch (1 cycle of re-categorization if i remember for both k8 and core2), so it's either that or some moving around. Let me cite Intel(r) 64 and IA-32 Architectures Optimization Reference Manual, 5-1, When writing SIMD code that works for both integer and floating-point data, use the subset of SIMD convert instructions or load/store instructions to ensure that the input operands in XMM registers contain data types that are properly defined to match the instruction. Code sequences containing cross-typed usage produce the same result across different implementations but incur a significant performance penalty. Using SSE/SSE2/SSE3/SSSE3 instructions to operate on type-mismatched SIMD data in the XMM register is strongly discouraged. You could find a similar note in AMD's doc for the k8.
recent troubles with float vectors bitwise ops
Hello, # cat vecop.cc templatetypename T T foo() { T a = { 0, 1, 2, 3 }, b = { 4, 5, 6, 7 }, c = a | b, d = c b, e = d ^ b; return e; } int main() { typedef float v4sf_t __attribute__ ((__vector_size__ (16))); typedef int v4si_t __attribute__ ((__vector_size__ (16))); foov4si_t(); foov4sf_t(); return 0; } # /usr/local/gcc-4.3-svn.old5/bin/g++ -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../configure --prefix=/usr/local/gcc-4.3-svn --enable-languages=c,c++ --enable-threads=posix --disable-checking --disable-nls --disable-shared --disable-win32-registry --with-system-zlib --disable-multilib --verbose --with-gcc=gcc-4.2 --with-gnu-ld --with-gnu-as --enable-checking=none --disable-bootstrap Thread model: posix gcc version 4.3.0 20070808 (experimental) # /usr/local/gcc-4.3-svn.old5/bin/g++ vecop.cc # /usr/local/gcc-4.3-svn.old6/bin/g++ -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../configure --prefix=/usr/local/gcc-4.3-svn --enable-languages=c,c++ --enable-threads=posix --disable-checking --disable-nls --disable-shared --disable-win32-registry --with-system-zlib --disable-multilib --verbose --with-gcc=gcc-4.2 --with-gnu-ld --with-gnu-as --enable-checking=none --disable-bootstrap Thread model: posix gcc version 4.3.0 20070819 (experimental) # /usr/local/gcc-4.3-svn.old6/bin/g++ vecop.cc vecop.cc: In function 'T foo() [with T = float __vector__]': vecop.cc:13: instantiated from here vecop.cc:4: error: invalid operands of types 'float __vector__' and 'float __vector__' to binary 'operator|' vecop.cc:5: error: invalid operands of types 'float __vector__' and 'float __vector__' to binary 'operator' vecop.cc:6: error: invalid operands of types 'float __vector__' and 'float __vector__' to binary 'operator^' Apparently it's still there as of right now, on x86-64 at least. I think this is not supposed to happen but i'm not sure, hence the mail.
Re: recent troubles with float vectors bitwise ops
Ian Lance Taylor wrote: What does it mean to do a bitwise-or of a floating point value? Apparently enough for a small vendor like Intel to propose such things as orps, andps, andnps, and xorps. So, that's what i feared... it was intentional. And now i guess the only sanctioned access to those ops is via builtins/intrinsics. Great. If only i could get the same quality of code when using intrinsics to begin with...
Re: g++ 4.3, troubles with C++ indexing idioms
On 7/24/07, Richard Guenther [EMAIL PROTECTED] wrote: For performance small arrays should be the same as individual members (I can see the annoying fact that initialization is a headache - this has annoyed me as well). For larger arrays (4 members), aliasing will make a difference possibly, making the array variant slower. Any union variant is expected to be slower for aliasing reasons (we do not do field-sensitive aliasing for unions). Confirmed :) And thanks for the clue about the threshold. In the end I would still recommend to go with array variants. I guess wishful thinking, or heresy, got me asking for a sanctioned address-this-as-an-array idiom.; now i'll go with the flow and use those 2nd class citizens of C++ aka array, even if i'm a bit sceptical about the performance equivalence (granted it isn't as obvious as it used to be, i need to investigate some more). But for sure it can't be as terrible as unions...
Re: g++ 4.3, troubles with C++ indexing idioms
On 7/19/07, Richard Guenther [EMAIL PROTECTED] wrote: Of course, if any then the array indexing variant is fixed. It would be nice to see a complete testcase with a pessimization, maybe you can file a bugreport about this? There's many issues for all alternatives and i'm not qualified to pinpoint them further. I've taken http://ompf.org/ray/sphereflake/ which is used as a benchmark already here http://www.suse.de/~gcctest/c++bench/raytracer/, because it's small, self contained and has such a basic 3 component class that's used all over. It doesn't use any kind of array access operator, but it's good enough to show the price one has to pay before even thinking of providing some. It has been adjusted to use floats and access members through accessors (to allow for a straighter comparison of all cases). variation 0 is the reference, a mere struct { float x,y,z; ...};, performs as good as the original, but wouldn't allow for any 'valid' indexing. variation 1 is struct { float f[3]; ... } variations 2,3,4,5 try to use some union # /usr/local/gcc-4.3-20070720/bin/g++ -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../configure --prefix=/usr/local/gcc-4.3-20070720 --enable-languages=c,c++ --enable-threads=posix --disable-checking --disable-nls --disable-shared --disable-win32-registry --with-system-zlib --disable-multilib --verbose --with-gcc=gcc-4.2 --with-gnu-ld --with-gnu-as --enable-checking=none --disable-bootstrap Thread model: posix gcc version 4.3.0 20070720 (experimental) # make bench [snip] sf.v0 real0m3.963s user0m3.812s sys 0m0.152s sf.v1 real0m3.972s user0m3.864s sys 0m0.104s sf.v2 real0m10.384s user0m10.261s sys 0m0.120s sf.v3 real0m10.390s user0m10.289s sys 0m0.104s sf.v4 real0m10.388s user0m10.265s sys 0m0.124s sf.v5 real0m10.399s user0m10.281s sys 0m0.116s There's some inlining difference between union variations and the first two, but they clearly stand in their own league anyway. So we can only seriously consider the first two. Variation #0 would ask for invalid c++ (pointer arithmetic abuse, not an option anymore) or forbidding array access operator and going to set/get + memcpy, but pretty optimal. Variation #1 (straight array) is quite annoying in C++ (no initializer list, need to reformulate all access etc...) and already show some slight pessimization, but it's not easy to track. Apparently g++ got a bit better lately in this regard, or it's only blatant on larger data or more complex cases. I hope this shows how problematic it is for the end user. // sphere flake bvh raytracer (c) 2005, thierry berger-perrin [EMAIL PROTECTED] // this code is released under the GNU Public License. // see http://ompf.org/ray/sphereflake/ // compile with ie g++ -O2 -ffast-math sphereflake.cc // usage: ./sphereflake [lvl=6] pix.ppm #include cmath #include iostream #include cstdlib #include limits #define GIMME_SHADOWS enum { childs = 9, ss= 2, ss_sqr = ss*ss }; /* not really tweakable anymore */ static const float infinity = std::numeric_limitsfloat::infinity(), epsilon = 1e-4f; #if VARIATION == 5 union v_t { // straight union; array left unharmed; just as horrible as the others. struct { float _x, _y, _z; }; float f[3]; v_t(const float a, const float b, const float c) : _x(a), _y(b), _z(c) {} float x() const { return _x; } float x() { return _x; } float y() const { return _y; } float y() { return _y; } float z() const { return _z; } float z() { return _z; } #else struct v_t { #endif #if VARIATION == 0 // best of the breed, but doesn't give way for an 'array access' operator. float _x, _y, _z; v_t(const float a, const float b, const float c) : _x(a), _y(b), _z(c) {} float x() const { return _x; } float x() { return _x; } float y() const { return _y; } float y() { return _y; } float z() const { return _z; } float z() { return _z; } #elif VARIATION == 1 // not as good, obvious 'array access' but forbids initializer lists float f[3]; v_t(const float a, const float b, const float c) { f[0] = a; f[1] = b; f[2] = c; } float x() const { return f[0]; } float x() { return f[0]; } float y() const { return f[1]; } float y() { return f[1]; } float z() const { return f[2]; } float z() { return f[2]; } #elif VARIATION == 2 // Richard Guenther's suggestion, worst of the worst. union { struct { float x, y, z; } a; float b[3]; } u; v_t(const float i, const float j, const float k) { u.a.x = i; u.a.y = j; u.a.z = k; } float x() const { return u.a.x; } float x() { return u.a.x; } float y() const { return u.a.y; } float y() { return u.a.y; } float z() const { return u.a.z; } float z() { return u.a.z; } #elif VARIATION == 3 // slightly better than variation #2, but still terrible. union { struct { float _x, _y, _z; }; float f[3]; }; v_t(const float a, const float b, const float c) : _x(a), _y(b), _z(c)
g++ 4.3, troubles with C++ indexing idioms
I have that usual heavy duty 3 fp components class that needs to be reasonably efficient and takes this form for g++ struct vec_t { float x,y,z; const float operator()(const uint_t i) const { return *(x + i); } float operator()(const uint_t i) { return *(x + i); } // -- guilty [snip ctors, operators related cruft] }; I use this notation because g++ does silly things with straight arrays (and C++ gets in the way), doesn't like union vec_t { struct { float x,y,z; }; float f[3]; const float operator()(const uint_t i) const { return m[i]; } float operator()(const uint_t i) { return m[i]; } }; much either, and seems to enjoy the first form (+ ctors with initializer lists) much. So far, so good. Alas, somewhere between gcc-4.3-20070608 (ok) and gcc-4.3-20070707 (not ok ever since), the non const indexing started to trigger bogus codegen with some skipped stores on x86-64, but of course only in convoluted situations. So, i can't produce a simple testcase. I can kludge around either by: . marking it __attribute__((noinline)) . turning it into a set operation doing a std::memcpy(x + i, f, sizeof(float)) . annoying the optimizer with the entertaining union vec_t { struct { float x,y,z; }; float f[3]; const float operator()(const uint_t i) const { return *(x + i); } float operator()(const uint_t i) { return *(x + i); } }; At this point i'd need some guidance from compiler developers because the compiler itself provides none (no warning whatsoever in any of those variations) and what i thought was acceptable apparently isn't anymore. What kind of idiom am i supposed to write such thing in to get back efficient and correct code?
Re: g++ 4.3, troubles with C++ indexing idioms
On 7/19/07, Richard Guenther [EMAIL PROTECTED] wrote: Well, I always used the array variant, but you should be able to do [snip] if you need to (why does the array form not work for you?) Because if you bench in some non trivial program, on x86/x86-64 at least, those variations (struct { float x,y,z; }, struct { float f[3]; } and some additional union layer) the last 2 consistently come out as slower. In the array case addressing seems to be the main issue (redundant scaling etc...); for the union variant, it's less clear but it seems it prohibits some copy/return value optimizations. Plus gcc apparently likes (well, used to) very much the *(x + i) idiom; all in all i had something to work with. Now i'm seeing *some* stores indexed in this way vanish, array addressing is still as bad as it was, unions still get me some pessimization and using the memcpy idiom asks me to give up on the idea of an array acces operator altogether. So i'm asking, which is going to be fixed in the foreseeable future.
Re: g++ 4.3, troubles with C++ indexing idioms
On 7/19/07, Richard Guenther [EMAIL PROTECTED] wrote: Of course, if any then the array indexing variant is fixed. It would be nice to see a complete testcase with a pessimization, maybe you can file a bugreport about this? By essence they're hard to trigger in small testcases (that's not where they matter anyway), and by my own previous experience large hairy bug reports get forgotten on the side of the road. But i'll see if can make up something convincing, provided i got the cause for the relative slowdown right.
Re: g++ 4.3, troubles with C++ indexing idioms
On 7/19/07, Dave Korn [EMAIL PROTECTED] wrote: Bogus codegen is the inevitable result of bogus code. Garbage in, garbage out. BTW, the const indexing is completely undefined too. That's the kind of answer i'd get from gcc-help and at that point i'd be none wiser because i already know that. I also know that up to gcc-4.3-20070608 it was provably giving correct results faster than any other variants. Being no language lawyer, that's the only metric i consider. It's no portability issue either because every compiler asks for a specific work around; which is quite sad considering how mundane that code is.
Re: Activate -mrecip with -ffast-math?
On 6/18/07, Richard Guenther [EMAIL PROTECTED] wrote: No, that's not the contract with -ffast-math. Note that -ffast-math enables -funsafe-math-optimizations which is allowed to change results (add/remove rounding operations, contract expressions, do transforms like a/b to a * 1/b, do transformations that get you bigger errors than 0.5ulp, etc.) I can't expect a division by a constant to survive -ffast-math unscathed, but then that's a change in precision and manageable. Being returned a NaN i'm not supposed to be see for a common case depending on some transformation is something else, entirely. But if i can't expect a mere division by 0, or sqrt of 0 (quite common with FTZ/DAZ on) to give me respectively an infinite and 0 and instead get a NaN (which i can't filter, you remember?) because of the NR round, that's pure madness. Hm, which particular case are you concerned about (maybe it was mentioned, but I don't remember the details)? Note that -ffast-math enables -ffinite-math-only as well, so the compiler assumes nothing will result in NaNs or Infs. Yes and that's why it's such a pain to handle them correctly while in -ffast-math. But if i generate some, then i get what i've asked for (and i'm in for a local fix). Fair enough. I'm not going to give up ie fast robust SSE ray/aabb slab tests (or ray/plane or...) because of some arbitrary rule; the hardware handles it just fine (yes there's a penalty, but then it's way faster than branching). For example, when doing 1/x and sqrt(x) via reciprocal + NR, you first get an inf from said reciprocal which then turns to a NaN in the NR stage but if you correct it by, say, doing a comparison to 0 and a 'and'. That's what ICC used to do in your back. That's what you'll find page 151 of the amdfam10 optimization manual. Because that's a common case. As far as i can see, there's no such provision in the current patch. At the very least provide a mean to look after those NaNs without losing sanity, like a way to enforce argument order of min/max[ss|ps|pd] without ressorting to inline asm. Well - certainly another reason for the Math BOF ;) We all expect very different things from -ffast-math or -funsafe-math-optimizations. You mean fast unsafe? I think there's quite a margin between to let someone shoot himself in the feet and put a gun on his head.
Re: Activate -mrecip with -ffast-math?
On 6/18/07, Giovanni Bajo [EMAIL PROTECTED] wrote: I understand your problems, but let me state that your objections are totally subjective. *You* need a specific behaviour from -ffast-math (eg: keep NaN/Inf), but that's not what *I* need. So, we have different goals. No. My NaN are my problem. Those generated by gcc, aren't. At the very least provide a cannonical (efficient) way to filter them (ie SSE min/max).
Re: Activate -mrecip with -ffast-math?
On 6/18/07, Uros Bizjak [EMAIL PROTECTED] wrote: IMO, due to limited range of operands for -mrecip pass (inf, -inf); where 0.0 is excluded, it should be keept out of -ffast-math. There is no point to fix reciprocals only for 0.0, we need to fix both conversions for infinity and 0.0, even in -ffast-math. Indeed there are holes in every direction when you pull in such transformation, and the cost of plugging every one of them would be prohibitive; the next batch of c2d supposedly will leave you with ~6 cycles to make it worth for a sqrt. Of course it only gets worse when you start composing. My point merely was that, considering one operation, you'd introduce NaN for a not so special value (0) which, in a *fast* math scenario, could be produced at any previous stage due to denormal clamping; with no sane way to take care of. Again, if you look at prior art (icc, AMD's manual...), that's the only special case they covered. Admittedly that's a trade off but not that unreasonable. Now, an option to remove such transformations from -ffast-math bag-o-tricks would be fine and would still buy gcc some Spec bragging rights :)
Re: Activate -mrecip with -ffast-math?
On 6/18/07, Richard Guenther [EMAIL PROTECTED] wrote: Of course there are cases with every optimization enabled by -ffast-math that can break existing programs. Just that we know of one case beforehand shouldn't prevent us from enabling -mrecip at -ffast-math (provided -mno-recip still works, regardless if provided before or after -ffast-math). [We'll at least get some more testing coverage this way] Argh! Please do not make -ffast-math even more of a pain to work with than it is already. You have to enable it, on the whole compilation unit, to get anywhere near decent performance; there's no escape: either you do not turn it on and everything slows to a crawl, or you pay for not being able to inline from another unit. Until now, the contract was: you have to deal with (and contain) NaN and infinities. Fair enough, even if tricky that remained manageable. But if i can't expect a mere division by 0, or sqrt of 0 (quite common with FTZ/DAZ on) to give me respectively an infinite and 0 and instead get a NaN (which i can't filter, you remember?) because of the NR round, that's pure madness. So please, for the love of everything's sacred, leave such stunts out of -ffast-math. PS: and it's not like such reciprocals + NR couldn't be done with intrinsics or easily handle such common case.
x86-64 -mcx16, picky __sync_val_compare_and_swap?
While doing (or trying to) some cleanup thanks to -mcx16, i've been a bit surprised that -- cut -- typedef int TItype __attribute__ ((mode (TI))); TItype m_128; void test(TItype x_128) { m_128 = __sync_val_compare_and_swap (m_128, x_128, m_128); } #include xmmintrin.h typedef __m128i foo_t; //typedef TItype foo_t; foo_t foo; void test2(foo_t x_128) { foo = __sync_val_compare_and_swap (foo, x_128, foo); } int main() { return 0; } -- cut -- # /usr/local/gcc-4.3-20070323/bin/gcc -O2 -mcx16 xchg16.c -o xchg16 xchg16.c: In function 'test2': xchg16.c:16: error: incompatible type for argument 1 of '__sync_val_compare_and_swap' # /usr/local/gcc-4.3-20070323/bin/gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../configure --prefix=/usr/local/gcc-4.3-20070323 --enable-languages=c++ --enable-threads=posix --with-system-zlib --enable-__cxa_atexit --disable-checking --disable-nls --disable-multilib --enable-bootstrap --with-gcc --with-gnu-as --with-gnu-ld Thread model: posix gcc version 4.3.0 20070323 (experimental) Am i just wrong believing that ought to work?
Re: x86-64 -mcx16, picky __sync_val_compare_and_swap?
On 4/2/07, Richard Henderson [EMAIL PROTECTED] wrote: On Mon, Apr 02, 2007 at 04:23:21PM +0200, tbp wrote: Am i just wrong believing that ought to work? Yes. It's hard to argue with a terse compiler or maintainer. Perhaps i should have picked an easier target like http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html: GCC will allow any integral scalar or pointer type that is 1, 2, 4 or 8 bytes in length of which that TItype from the testsuite testcase is not. In any case thanks for the clarification.
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/29/07, Mark Mitchell [EMAIL PROTECTED] wrote: It doesn't need to be a small testcase. If you have a preprocessed source file and a command-line, I'm sure one of the GCC developers would be able to analyze the situation. We're all good at isolating problems, even starting with big complicated inputs. This now known as PR / 30627 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30627 PS: Thanks to Vladimir for his input.
remarks about g++ 4.3 and some comparison to msvc icc on ia32
Let it be clear from the start this is a potshot and while those trends aren't exactly new or specific to my code, i haven't tried to provide anything but specific data from one of my app, on win32/cygwin. Primo, gcc getting much better wrt inling exacerbates the fact that it's not as good as other compilers at shrinking the stack frame size, and perhaps as was suggested by Uros when discussing that point a pass to address that would make sense. As i'm too lazy to properly measure cruft across multiple compilers, i'll use my rtrt app where i mostly control large scale inlining by hand. objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl -ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head -n 10 msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160 icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396 gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132 That's with msvc8 sp1, icc 9.1.033, g++ 4.3-20070119, each compiler being configured to optimize as much as possible for speed. That confirms what i see when checking codegen for specific functions. Secundo, while i very much appreciate the brand new string ops, it seems that on ia32 some array initialization cases where left out, hence i still see oodles of 'movl $0x0' when generating code for k8. Also those zeroings get coalesced at the top of functions on ia32, and i have a function where there's 3 pages of those right after prologue. See the attached 'grep 'movl $0x0' dump. movl0.S.bz2 Description: BZip2 compressed data
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/28/07, Richard Guenther [EMAIL PROTECTED] wrote: On 1/28/07, tbp [EMAIL PROTECTED] wrote: objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl -ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head -n 10 msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160 icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396 gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132 It would have been nice to tell us what the particular columns in this table mean - now we have to decrypt objdump params and perl postprocessing ourselves. I should have known better than to post on a sunday morning. Sorry. That's the sorted 10 largest stack allocations in binaries produced by each compiler (presuming most everything falls in place). Each time i verify codegen for a function across all 3, gcc always has the largest frame by a substantial amount (on ia32). And that's what that rigorous table is trying to demonstrate ;) Basically i'm wondering if a stack frame shrinking pass [ ] is possible, [ ] makes no sense, [ ] has been done, [ ] is planed etc... (If you are interested in stack size related to inlining you may want to tune --param large-stack-frame and --param large-stack-frame-growth). Recently g++ 4.3 has started to complain about warning: inlining failed in call to 'xxx': --param large-stack-frame-growth limit reached [-Winline]. Bumping said large-function-growth by an ungodly amount did the trick. But it was the sure sign inlining was being fixed. There's much less need to babysit it, thanks a lot to whomever wrote those patches.
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote: Actually we do have one stack frame shrinking pass already. It depends on where the bloat is comming from - we can pack (with some limitations) memory used by structures/arrays used by different inline functions or lexical blocks. We don't do any packing of spilled registers nor shring wrapping other compilers sometimes implement. Ah. So there's already some shrinkage. I don't think i can blame spilling for all that waste, but then i also have no idea what that shring wrapping involves. Also i think it's only a bit worse with C++ where some idioms appear to cause more trouble. It would be nice to have a cheat sheet of do and don't :) It seems my previous obese mail got axed a bit, http://ompf.org/vault/frontend.ii.bz2 http://ompf.org/vault/rt_render_packet.ii.bz2
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote: Also having some testcases showing inlining deffects in GCC would be very interesting for me. Now after IPA-SSA has been merged, I plan to do some retuning of inliner for 4.3 release since a lot has changes about properties of it's input and it was originally designed to operate well on IL used by early tree-ssa. Gcc, well g++ really, used to be so bad at the inlining game, ie single op functions/ctors suddendly left out, there was no other options than to explicitly direct inlining if one cared about performance. So i don't have much to show, for what i monitored wasn't under g++ juridiction. Now i know it has improved (much) because obviously other parts are being stressed. Considering information about stack frame size in the inlining costs is one of things I believe we should do but it is also dificult to tune without interesting testcases for it. I have no idea what would make such testcase interesting to you. But i can try. You'll find 2 preprocessed GPLed sources attached with frontend.cc, app::frontend_loop() (i don't particularly care about that function, but on ia32 - x86-64 is immune - g++ is quite creative about it (large frame, oodles of upfront zeroing, even if it's a bit better with the gcc-4.3-20070119 snapshot)) frame size, msvc 1152 bytes, icc 2108, g++ 2604 rt_render_packet.cc, horde::grunt_render_tiles_packet(...) (this one i care about, inlining is controlled) frame size, msvc 1688, icc 1804, gcc 1932 Performance wise on that one msvc lags by 25% and gcc has a slight lead of a couple percent on icc. note: take 2, http://ompf.org/vault/frontend.ii.bz2 http://ompf.org/vault/rt_render_packet.ii.bz2
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote: I am not quite sure what you mean by direct inlining here. At -O2 G++ Decorating everything in sight with attribute always_inline/noinline (flatten wasn't an option because it used to be troublesome and not as 'portable' across compilers). I would be interested to know about obvious mistakes GCC do - GCC now has logic to set cost of inlining wrapper functions (ie functions doing just one extra call and casts) to at most 0. It might be interesting to know if some common scenarios are missed. I guess i should remove those attribute and see what it looks like. Well, we are working on it ;) You can take a look at c++ benchmarks http://www.suse.de/~gcctest the work is ongoing since cgraph was implemented in 2003, another retunning happen at about 4.0 timeframe, 4.3 has the SSA based IPA that should be another improvement. I'm aware of that progression and some of my code is already being tested http://www.suse.de/~gcctest/c++bench/raytracer/ ;) 4.2 made a substantial difference for me, and it seems 4.3 is well on its way (even if it's a bit chaotic at times); IPA when enabled used to ICE on me and recently started to work, but i've failed to notice a difference (efficiency wise) yet. I guess i should wait a bit more. I very much appreciate the string op stuff, and i'm eagerly waiting for the assume() directive (wink wink). Thanks, what is definitly most interesting for me is self contained testcase I can easilly compile and run, like we have tramp3d. I will definitly take a lok at your testcases, but perhaps only after returning from trip at next weekend since I am running out of time for all my TODOs today ;) It's still very much in flux, but once it stabilizes a bit i'll dump everything into a self contained black box of doom. Concerning the frame sizes, we really need some kind of analysis from where it is comming - ie whether GCC simply inline too much together, or fail to pack well the structures using existing algorithm or it is register pressure problem. I'm out of my league. I know the frontend_loop function isn't as horrible on x86-64, giving some credit to the register pressure hypothesis, but then that code isn't doing anything fancy. For the other function, which heavily uses SSE vector intrinsics, g++ is really doing a good job, if only for the, sometimes, duplicated structures here there and the larger frame. But you can rule out g++'s inlining heuristic as it has no (or shouldn't have) any freedom. If there's anything i can do, do not hesitate. And thanks for taking notice.
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote: BTW when inlining seems to make so noticeable difference, did you try to use profile feedback? Once a year, i try. But then it boils down to the fact that as a programmer i have no way to express how/where i want gcc to put its nose into. And i get back to fixing branches, inlining and unrolling (wink) by hand. I'm aware of that progression and some of my code is already being tested http://www.suse.de/~gcctest/c++bench/raytracer/ ;) I see, we didn't seem to make that much progress on this testcase performance wise yet ;) It's a silly 100 LOC raytracer and historically g++ already did the Right Thing[tm] (inlining everything), there's not much left to be gained. For the other function, which heavily uses SSE vector intrinsics, g++ is really doing a good job, if only for the, sometimes, duplicated structures here there and the larger frame. But you can rule out g++'s inlining heuristic as it has no (or shouldn't have) any freedom. Hmm, so then it should be esither structure packing or regalloc. I will be able to take a look only after returning from a course. Honza Regalloc is a lost cause on ia32 :) Note that nowadays g++ is up to the point where despite those wastes, it's still faster to inline it all in one rendering function than splitting. And i think you can also put gcse on the culprit list.
build failure, gcc-4.3-20070126 snapshot, cygwin
[ -f stage_final ] || echo stage3 stage_final make[1]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo' make[2]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo' make[3]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo' rm -f stage_current make[3]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo' make[2]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo' make[2]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo' make[3]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libiberty' make[4]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libiberty/testsuite' make[4]: Nothing to be done for `all'. make[4]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libiberty/testsuite' make[3]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libiberty' make[3]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/intl' make[3]: Nothing to be done for `all'. make[3]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/intl' make[3]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/build-i686-pc-cygwin/libiberty' make[4]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/build-i686-pc-cygwin/libiberty/testsuite' make[4]: Nothing to be done for `all'. make[4]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/build-i686-pc-cygwin/libiberty/testsuite' make[3]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/build-i686-pc-cygwin/libiberty' make[3]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/build-i686-pc-cygwin/fixincludes' make[3]: Nothing to be done for `all'. make[3]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/build-i686-pc-cygwin/fixincludes' make[3]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libcpp' test -f config.h || (rm -f stamp-h1 make stamp-h1) make[3]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libcpp' make[3]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libdecnumber' make[3]: Nothing to be done for `all'. make[3]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libdecnumber' make[3]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/gcc' make[3]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/gcc' Checking multilib configuration for libgcc... make[3]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/i686-pc-cygwin/libgcc' # If this is the top-level multilib, build all the other # multilibs. make[4]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/i686-pc-cygwin/libgcc' if [ -z ]; then \ true; \ else \ rootpre=`${PWDCMD-pwd}`/; export rootpre; \ srcrootpre=`cd ../../../libgcc; ${PWDCMD-pwd}`/; export srcrootpre; \ lib=`echo ${rootpre} | sed -e 's,^.*/\([^/][^/]*\)/$,\1,'`; \ compiler=/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/./gcc/xgcc -B/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/./gcc/ -B/usr/local/gcc-4.3-20070126/i686-pc-cygwin/bin/ -B/usr/local/gcc-4.3-20070126/i686-pc-cygwin/lib/ -isystem /usr/local/gcc-4.3-20070126/i686-pc-cygwin/include -isystem /usr/local/gcc-4.3-20070126/i686-pc-cygwin/sys-include; \ for i in `${compiler} --print-multi-lib 2/dev/null`; do \ dir=`echo $i | sed -e 's/;.*$//'`; \ if [ ${dir} = . ]; then \ true; \ else \ if [ -d ../${dir}/${lib} ]; then \ flags=`echo $i | sed -e 's/^[^;]*;//' -e 's/@/ -/g'`; \ if (cd ../${dir}/${lib}; make AR=ar AR_FLAGS=rc CC=/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/./gcc/xgcc -B/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/./gcc/ -B/usr/local/gcc-4.3-20070126/i686-pc-cygwin/bin/ -B/usr/local/gcc-4.3-20070126/i686-pc-cygwin/lib/ -isystem /usr/local/gcc-4.3-20070126/i686-pc-cygwin/include -isystem /usr/local/gcc-4.3-20070126/i686-pc-cygwin/sys-include CFLAGS=-g -fkeep-inline-functions DESTDIR= EXTRA_OFILES= HDEFINES= INSTALL=/usr/bin/install -c INSTALL_DATA=/usr/bin/install -c -m 644 INSTALL_PROGRAM=/usr/bin/install -c LDFLAGS= LOADLIBES= RANLIB=ranlib SHELL=/bin/sh prefix=/usr/local/gcc-4.3-20070126 exec_prefix=/usr/local/gcc-4.3-20070126 libdir=/usr/local/gcc-4.3-20070126/lib libsubdir=/usr/local/gcc-4.3-20070126/lib/gcc/i686-pc-cygwin/4.3.0 tooldir=/usr/local/gcc-4.3-20070126/i686-pc-cygwin \ CFLAGS=-g -fkeep-inline-functions ${flags} \ CCASFLAGS= ${flags} \ FCFLAGS= ${flags} \ FFLAGS= ${flags} \ ADAFLAGS= ${flags} \ prefix=/usr/local/gcc-4.3-20070126 \ exec_prefix=/usr/local/gcc-4.3-20070126 \ GCJFLAGS= ${flags} \ CXXFLAGS=-g -O2 ${flags} \ LIBCFLAGS=-g -fkeep-inline-functions ${flags}
Re: fancy x87 ops, SSE and -mfpmath=sse,387 performance
On 8/6/06, Paolo Bonzini [EMAIL PROTECTED] wrote: Is there a way to enable such exotic codegen for 32bit environments? With libgcc-math you didn't have exotic instructions, but you had trascendental operations compiled with -mfpmath=sse and with a special ABI. -mfpmath=sse won about 8% over -mfpmath=387 for tramp3d, which does have trascendental operations. Let's see what happens for 4.3. I'm not sure i groked the fuss about libgcc-math. What i know is that -mfpmath=sse in recent gcc does wonders, just like SSE implementations of such library calls as i can experience them in a sane environment like linux x86-64. But it's truely horrible in cygwin and off the mark by an order of magnitude. My complaint is that atm the only stopgap on such platform is to ressort to -mfpmath=sse,387 which is not without drawbacks. I understand -march=k8 -mfpmath=sse -mfancy-math-387 is out of question, but could clarify what i should expect from 4.3?
fancy x87 ops, SSE and -mfpmath=sse,387 performance
Basically i'd like to have the cake and also eat it. With g++-4.2-20060805/cygwin on a k8 box on some software path with lots of sp float ops but no transcendentals or library calls -mfpmath=sse,387: 5.2 Mray/s -mfpmath=sse: 6 Mray/s That 15% performance difference is no surprise when you see things like 4037c8: flds 0x4(%esp) 4037cc: mulss %xmm5,%xmm2 4037d0: fsubrp %st,%st(1) 4037d2: movss %xmm1,0x4(%esp) 4037d8: addss 0x278(%esp,%ecx,4),%xmm0 4037e1: flds 0x4(%esp) 4037e5: fsubrp %st,%st(1) 4037e7: addss %xmm2,%xmm0 4037eb: movss %xmm0,0x4(%esp) 4037f1: flds 0x4(%esp) 4037f5: fdivrp %st,%st(1) 4037f7: fcomi %st(1),%st 4037f9: fldz 4037fb: setae %dl 4037fe: fcomip %st(1),%st 403800: seta %al 403803: or %al,%dl 403805: je 4036ca Therefore -mfpmath=sse is the way to go and is in fact on par or better than what i get out of icc 9.1 for the same code. Where it gets ugly is when, for example, you throw some cosf() into the same compilation unit as with -mfpmath=sse you pay for some really really slow library function calls (at least on cygwin). Wishful thinking got me trying -march=k8 -mfpmath=sse -mfancy-math-387, to no avail :( Is there a way to enable such exotic codegen for 32bit environments?
g++ 4.2, cygwin, NUMA awareness issue
As i don't know which party (g++, stdc++, cygwin) to put the blame on i'll start here. I've traced back a weird performance issue to a 'new' returning non cpu-local memory but only when the binary is launched from the shell/console. That suggests some crt friction. (threads where those allocations happen are properly binded to one cpu) That's on xp sp2, on a bi-k8 box with Using built-in specs. Target: i686-pc-cygwin Configured with: ../configure --prefix=/usr/local/gcc-4.2-20060624 --enable-languages=c,c++ --enable -threads=posix --with-system-zlib --disable-checking --disable-nls --disable-shared --disable-win32- registry --verbose --enable-bootstrap --with-gcc --with-gnu-ld --with-gnu-as --with-cpu=k8 Thread model: posix gcc version 4.2.0 20060624 (experimental) Does that ring a bell or shall i move along the chain? :)
Re: Optimize flag breaks code on many versions of gcc (not all)
On 6/19/06, Richard Guenther [EMAIL PROTECTED] wrote: Using -mfpmath=sse -msse2 is a workaround if you have a processor that supports SSE2 instructions. As opposed to -ffloat-store, it works reliably and with no performance impact. Such slab test can be turned into a branchless sequence of SSE min/max, even for filtering infinities around dir ~= 0; it's much simpler and efficient to intersect 4 rays against one box at once though. Without intrinsics a NaN oblivious version would be like: static float minf(const float a, const float b) { return (a b) ? a : b; } static float maxf(const float a, const float b) { return (a b) ? a : b; } bool_t intersect_ray_box(const aabb_t box, const rt::mono::ray_t ray, float lmin, float lmax) { float l1 = (box.min.x - ray.pos.x) * ray.inv_dir.x, l2 = (box.max.x - ray.pos.x) * ray.inv_dir.x; lmin= minf(l1,l2); lmax= maxf(l1,l2); l1 = (box.min.y - ray.pos.y) * ray.inv_dir.y; l2 = (box.max.y - ray.pos.y) * ray.inv_dir.y; lmin= maxf(minf(l1,l2), lmin); lmax= minf(maxf(l1,l2), lmax); l1 = (box.min.z - ray.pos.z) * ray.inv_dir.z; l2 = (box.max.z - ray.pos.z) * ray.inv_dir.z; lmin= maxf(minf(l1,l2), lmin); lmax= minf(maxf(l1,l2), lmax); return (lmax = lmin) (lmax = 0.f); }
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Paolo Bonzini [EMAIL PROTECTED] wrote: Wait wait. PR/21195 is about inlining the SSE builtins. No. PR/21195 was really about inline heuristic going ballistic. Those intrinsics are thin wrappers around builtins, and ultimately resolve to a couple of operations. Typical C++ (accessors/ctors) also presents lots of such small functions. And guess what, same cause same symptom. There's no sensible metric by which code i've quoted in previous mail makes sense. Size? Nope. Execution time? Certainly not. Again whether or not SSE ops are involved was and is still irrelevant. Your case seems to be different, because it involves inlining user routines. Again, you need to give us the preprocessed source code for us to look at your bug effectively. Thanks for the tip, but i'll pass. I've done my duty already. Months ago there was 2 options for fixing PR/21195: a) Fix the inlining heuristic. b) Kludge all intrinsics with always_inline. I've tried to argue a bit but to no avail. So, while you remain convinced everything's fine with the inliner, i'll keep tagging every function in my code with always_inline/noinline where performance matters.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Richard Guenther [EMAIL PROTECTED] wrote: Starting with gcc 4.1.0 we have inline heuristics in place that will _always_ inline such simple wrappers. So, if this still happens, there is a bug in the heuristics and that should be reported. Before 4.1.0 the heuristics were bogus and wrappers were not inlined all the time. So, can you verify you are happy with the heuristics in 4.1.0 No i'm not, and i've used a pristine 4.1.0 in http://gcc.gnu.org/ml/gcc/2006-03/msg00410.html I haven't tried that particular testcase on 4.2.x, but some weeks ago i had to go thru all my code again to put always_inline in some forgotten places because i was seeing even empty ctors not being inlined (to the effect of having a call to a ret). So in this regard, 4.1.0 4.2.x still exhibit that kind of behaviour. It seems to trigger when some particular threshold is met, either for a function or unit, then nothing at all gets inlined but functions tagged with always_inline; of course major performance regression ensues.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Richard Guenther [EMAIL PROTECTED] wrote: Of course from 4.1.0 on you can easier stick an __attribute__((flatten)) on the function you want everything inlined to (finalblow) and get everything inlined into it. But that's not really what i'm after: i expect trivial functions to get inlined no matter what at a given -Ox. With always_inline on it, the wrappers are no longer inlined - this is a bug and should be reported. Can you report a bugzilla for the bad interaction between always_inline and inlining of simple wrappers? I will report it again then.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Richard Guenther [EMAIL PROTECTED] wrote: I see the bug and will have a fix in a moment. You made my day. Or you're about to. Unless you're lying and i'll have to curse you for 7 generations.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Richard Guenther [EMAIL PROTECTED] wrote: http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00739.html /me ventilates. You're my hero.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, tbp [EMAIL PROTECTED] wrote: On 3/13/06, Richard Guenther [EMAIL PROTECTED] wrote: http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00739.html /me ventilates. You're my hero. A double+ hero on top of that. http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00737.html I think i've hit that one that one too; reported here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26650 Well, i can always dream.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Richard Guenther [EMAIL PROTECTED] wrote: I don't think this is related, and a quick check with the patch shows still unaligned moves to the stack. Patience is a virtue i guess :) Is there good chances your inlining fix will hit mainline soon?
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/12/06, Steven Bosscher [EMAIL PROTECTED] wrote: Yes, why is the benchmark not valid? It is valid. We should understand why this behavior has changed so drastically. This benchmark maybe useless, it still exposes a weakness of gcc4. At least it's not news to me: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21195 So that PR has been closed when gcc-devs marked all those intrinsics as force_inline. That's also the kludge i use with my code. The real problem is once you start marking some functions as force_inline, you upset the inlining heuristic even more creating even more silly inlining misses, rince, repeat. At the end of the day, everything is marked either force_inline or noinline and you'd be better off without a heuristic at all.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Andrew Pinski [EMAIL PROTECTED] wrote: Actually the best way of improving the inline heuristics is to get a real testcase (and not some benchmark) where the inline heuristics is messed up. Ah, you mean a brand new testcase because PR-21195 wasn't good enough? $ /usr/local/gcc-4.1.0/bin/g++ -v Using built-in specs. Target: i686-pc-cygwin Configured with: ../configure --prefix=/usr/local/gcc-4.1.0 --enable-languages=c,c++ --enable-threads=posix --with-system-zlib --disable-checking --disable-nls --disable-shared --disable-win32-registry --verbose --enable-bootstrap --with-gcc --with-gnu-ld --with-gnu-as --with-cpu=k8 Thread model: posix gcc version 4.1.0 /usr/local/gcc-4.1.0/bin/g++ -g -O3 -march=k8 -msse2 -o pr-inline.o pr-inline.cc #include xmmintrin.h static __m128 mm_max_ps(const __m128 a, const __m128 b) { return _mm_max_ps(a,b); } static __m128 mm_min_ps(const __m128 a, const __m128 b) { return _mm_min_ps(a,b); } static __m128 mm_mul_ps(const __m128 a, const __m128 b) { return _mm_mul_ps(a,b); } static __m128 mm_div_ps(const __m128 a, const __m128 b) { return _mm_div_ps(a,b); } static __m128 mm_or_ps(const __m128 a, const __m128 b) { return _mm_or_ps(a,b); } static int mm_movemask_ps(const __m128 a) { return _mm_movemask_ps(a); } static __attribute__ ((always_inline)) bool bloatit(const __m128 a, const __m128 b) { const __m128 v0 = mm_max_ps(a,b), v1 = mm_min_ps(a,b), v2 = mm_mul_ps(a,b), v3 = mm_div_ps(a,b), g0 = mm_or_ps(_mm_or_ps(_mm_or_ps(v0,v1), v2), v3), v4 = mm_min_ps(mm_or_ps(a,b),mm_div_ps(b,a)), v5 = mm_max_ps(mm_min_ps(a,mm_div_ps(b,a)), mm_or_ps(b, mm_div_ps(b,g0))), g1 = mm_or_ps(g0,mm_or_ps(v4,v5)); return mm_movemask_ps(g1); } bool finalblow(const __m128 a, const __m128 b, const __m128 c, const __m128 d, const __m128 e, const __m128 f) { return bloatit(a,b) bloatit(c,d) bloatit(e,f) bloatit(a,c) bloatit(b,d) bloatit(c,e) bloatit(d,f) bloatit(b,a) bloatit(d,c) bloatit(f,e) bloatit(c,a) bloatit(d,b) bloatit(e,c) bloatit(f,d); } int main() { return 0; } 00401080 mm_mul_ps(float __vector, float __vector): 401080: push %ebp 401081: mulps %xmm1,%xmm0 401084: mov%esp,%ebp 401086: sub$0x8,%esp 401089: leave 40108a: ret 40108b: nop 40108c: lea0x0(%esi),%esi 00401090 mm_or_ps(float __vector, float __vector): 401090: push %ebp 401091: orps %xmm1,%xmm0 401094: mov%esp,%ebp 401096: sub$0x8,%esp 401099: leave 40109a: ret 40109b: nop 40109c: lea0x0(%esi),%esi 004010a0 mm_div_ps(float __vector, float __vector): 4010a0: divps %xmm1,%xmm0 4010a3: push %ebp 4010a4: mov%esp,%ebp 4010a6: sub$0x8,%esp 4010a9: leave 4010aa: ret 4010ab: nop ... 004010e0 finalblow(float __vector, float __vector, float __vector, float __vector, float __vector, float __vector): ... 401101: call 4010c0 mm_max_ps(float __vector, float __vector) 401106: movaps %xmm0,0xf958(%ebp) 40110d: movaps 0xf8f8(%ebp),%xmm1 401114: movaps 0xf908(%ebp),%xmm0 40111b: call 4010b0 mm_min_ps(float __vector, float __vector) 401120: movaps 0xf8f8(%ebp),%xmm1 401127: movaps %xmm0,0xf948(%ebp) 40112e: movaps 0xf908(%ebp),%xmm0 401135: call 401080 mm_mul_ps(float __vector, float __vector) 40113a: movaps 0xf8f8(%ebp),%xmm1 401141: movaps %xmm0,0xf938(%ebp) 401148: movaps 0xf908(%ebp),%xmm0 40114f: call 4010a0 mm_div_ps(float __vector, float __vector) 401154: movaps 0xf958(%ebp),%xmm1 40115b: orps 0xf948(%ebp),%xmm1 401162: movaps %xmm1,0xf958(%ebp) 401169: movaps %xmm0,%xmm1 40116c: movaps 0xf958(%ebp),%xmm0 401173: orps 0xf938(%ebp),%xmm0 40117a: call 401090 mm_or_ps(float __vector, float __vector) 40117f: movaps 0xf908(%ebp),%xmm1 401186: movaps %xmm0,0xf928(%ebp) 40118d: movaps 0xf8f8(%ebp),%xmm0 401194: call 4010a0 mm_div_ps(float __vector, float __vector) 401199: movaps 0xf8f8(%ebp),%xmm1
g++ 4.1.0/4.2.x, x86/x86-64, segfaults due to bogus SSE alignments
This bug is really transient, and AFAIK i only trigger it when using the cluebat on g++, that is bloating every function in sight appropriately with always_inline/noinline attributes, in a unit that inflates much. Tracked one occurence to something like that: union float4_t { float f[4]; __m128 v; ... }; static void foobar() { float4_t __attribute__((aligned (16))) bar; ... __m128 foo; ... bar = foo; } If i let g++ decide if foobar() should be inlined or not, everything's fine (but performance of course). Then if i force_inline foobar() i may or may not get something to the effect of: 40666a: movaps %xmm0,0x348(%esp) 406672: mov0x348(%esp),%eax 406679: mov%eax,0x310(%esp) 406680: mov0x34c(%esp),%eax 406687: movaps 0x210(%esp),%xmm0 40668f: mov%eax,0x314(%esp) 406696: mov0x350(%esp),%eax 40669d: movaps %xmm0,0x40(%esp) 4066a2: mov%eax,0x318(%esp) 4066a9: mov0x354(%esp),%eax Why that value gets suddenly copied around, i don't know. It doesn't matter much anyway, as the program won't survive past the bogus store. It's not just related to that kind of mixed unions either, and again it clearly depends on surrounding functions being force_inlined and noinlined and lots of stuff ending up on the stack. I can trigger it on cygwin and linux, with g++ 4.1.0 and various 4.2.x and once triggered using -0s or -Ox doesn't matter; it's been there for a long time but that's the first time i can track it down somehow (inlining heuristics being extremly anyway). I haven't made a bugreport yet, as that would require disclosing large amounts of code, but i'd like to know if it's a known issue by any chance. Regards, tbp.
Re: g++ 4.1.0/4.2.x, x86/x86-64, segfaults due to bogus SSE alignments
On 3/11/06, Daniel Jacobowitz [EMAIL PROTECTED] wrote: Unlikely, since you haven't described at all what the problem is. That's why we prefer bug reports with testcases. ...segfaults due to bogus SSE alignments 40666a: movaps %xmm0,0x348(%esp)
g++ 4.2.x and (auto) inlining
Hello, i've just experienced a 40%+ run-time performance drop that, in fine, was due to g++ refusing to auto-inline trivial ctors and the like in a cramped unit (featuring no and forced inlines). That's not the first time i meet that snafu, but what kinda surprises me is the fact that i've recently removed more code than added (granted, that doesn't mean much) and that unit was already being compiled with: --param inline-unit-growth=1 --param max-inline-insns-recursive=1. I had to bump both by an order of magnitude to get things flying again. Even if it works, i'm a bit worried that in some not too distant future i may run out of digits. That was with gcc version 4.2.0 20060204 and i was wondering if semi-recently, g++ behaviour regarding auto inlining had been tweaked or something. In any case, if there's a better stopgap that doesn't imply explicitely force inlining everything in sight, i'd like to know. Or if there's something in the work. Regards, tbp.
Re: x86-64, I definitely can't make sense out of that
On 2/4/06, Andrew Pinski [EMAIL PROTECTED] wrote: Dale Johannesen and I came up with a patch to the C++ front-end for this except it did not work with some C++ cases. Ah, so i'm not totally inane. Is there a PR i can track for this one?
x86-64, I definitely can't make sense out of that
As i coulnd't understand why g++ insisted on spitting movq $0, stack only to rewrite the same place a few cycles behind (with a different width), i've made a testcase and now 20mn later i'm even more puzzled. #include xmmintrin.h #include stdio.h struct dir_t { __m128 x,y,z; }; int creative_codegen(const struct dir_t *dir) { const int sx = _mm_movemask_ps(dir-x), sy = _mm_movemask_ps(dir-y), sz = _mm_movemask_ps(dir-z), signs_all[4] = { !(sx 0), !(sy 0), !(sz 0), 0 }, coherent = (((sx == 0) | (sx == 15)) ((sy == 0) | (sy == 15)) ((sz == 0) | (sz == 15))); if (coherent) { int i; for (i=0; i4; ++i) printf(%d,signs_all[i]); } return coherent; } int main(int argc, void **argv) { return creative_codegen((struct dir_t*)argv); } with g++ -O2 (4.0.3, 4.2.0 20060121) [...] 40056d: movq $0x0,0x10(%rsp) # ? 400576: movq $0x0,0x18(%rsp) # ?? [...] 40058b: movq $0x0,(%rsp) # ??? 400593: movq $0x0,0x8(%rsp) # ok [...] 40059f: mov%eax,0x10(%rsp) # ok [...] 4005b1: mov%eax,0x14(%rsp) # ok [...] 4005c4: mov%eax,0x18(%rsp) # ok If compiled with gcc, there's no such preliminary movq. So the question is, what is so obviously flying way over my head?
Re: x86-64 linux, gomp ICE in trunk
On 1/25/06, Diego Novillo [EMAIL PROTECTED] wrote: You'll need to do what this message suggests. http://gcc.gnu.org/bugzilla/ Sorry for the lag. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25983
x86-64 linux, gomp ICE in trunk
Hello, I wanted to play a bit with OpenMP after fighting a (long) while to get a 4.2 snapshot compiled on my debian64 box... alas... fresh svn checkout, multilib disabled because it's a no go on my box. # /usr/local/gomp/bin/g++ -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../configure --prefix=/usr/local/gomp --enable-languages=c++ --enable-threads=posix --with-system-zlib --enable-__cxa_atexit --disable-multilib --enable-bootstrap --with-gcc --with-gnu-as --with-gnu-ld Thread model: posix gcc version 4.2.0 20060124 (experimental) testcase: int toto() { int a=0; #pragma omp single { for (int i=0; i10; ++i) a += i; } return a; } int main() { return toto(); } /usr/local/gomp/bin/g++ -fopenmp main.cc -o omp main.cc: In function 'int toto()': main.cc:5: internal compiler error: in cp_parser_pragma, at cp/parser.c:17629 Please submit a full bug report, with preprocessed source if appropriate. See URL:http://gcc.gnu.org/bugs.html for instructions. Command line options or the precise omp pragma used don't really matter, i get a crash on any valid omp directive; gcc-4.2-20060121 is ICE happy the same way. As a side note while trying to get the compiler built with some debug info, i've hit a case where it couldn't libgomp.spec once installed (a --disable-shared configuration). If there's a workaround that would make my day :)
Re: x86-64 linux, gomp ICE in trunk
On 1/25/06, Richard Henderson [EMAIL PROTECTED] wrote: c++ gomp is not merged to mainline. Indeed, that makes up for a solid reason not to work. Should i hold my breath?
Re: x86-64 linux, gomp ICE in trunk
On 1/25/06, Diego Novillo [EMAIL PROTECTED] wrote: A couple more weeks, or you can try the gomp branch. Thanks, will do. Hopefully i won't fall for the ICE trick that easily next time.
Re: x86-64 linux, gomp ICE in trunk
On 1/25/06, Diego Novillo [EMAIL PROTECTED] wrote: Well, the compiler still shouldn't ICE. I'll send a fix shortly. I know i've exhausted my pseudo-ICE quota for the day, but i have another candidate knocking at the door with insistence: src/raytrace_packet.cpp: In member function 'void rt::raytracer_t::prender()': src/raytrace_packet.cpp:1411: internal compiler error: Segmentation fault Please submit a full bug report # /usr/local/gomp/bin/g++ -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../configure --prefix=/usr/local/gomp --enable-languages=c++ --enable-threads=posix --with-system-zlib --enable-__cxa_atexit --disable-multilib --enable-bootstrap --with-gcc --with-gnu-as --with-gnu-ld Thread model: posix gcc version 4.2.0-gomp-20050608-branch 20060119 (experimental) (merged 20060119) While i'm sure i'm terribly wrong one way or another, i'd apreciate some pointers.
Re: Constant propagation and address arithmetic
On 5/8/05, Steven Bosscher [EMAIL PROTECTED] wrote: Hi, Hello, I have looked at the GCSE CPROP passes with CSE path following disabled (-O1 -fgcse --param max-cse-path-length=1). The input code are the cc1-i files from 20040726 (with checking enabled). While that discussion flies way above my head, it seems to be about gcse and i have enough grievance about it to jump in. I've just pinged PR19680 (because it's still there) and just for the sake of it i've tried the newly reported PR21463 with -fno-gcse and it's quite interesting. as reported, with gcse: 00400610 foo_tfloat::bar_ref(float, float): 400610: ucomiss 0x4(%rdi),%xmm1 400614: lea0x4(%rdi),%rax 400618: lea0xfff8(%rsp),%rdx 40061d: movss %xmm0,0xfffc(%rsp) 400623: movss %xmm1,0xfff8(%rsp) 400629: movaps %xmm1,%xmm2 40062c: cmova %rdx,%rax 400630: movss (%rax),%xmm1 400634: ucomiss %xmm1,%xmm0 400637: ja 400641 foo_tfloat::bar_ref(float, float)+0x31 400639: lea0xfffc(%rsp),%rax 40063e: movaps %xmm0,%xmm1 400641: ucomiss (%rdi),%xmm2 400644: cmova %rdi,%rdx 400648: movss (%rdx),%xmm0 40064c: ucomiss %xmm0,%xmm1 40064f: jbe400655 foo_tfloat::bar_ref(float, float)+0x45 400651: movss (%rax),%xmm0 400655: repz retq without: 00400610 foo_tfloat::bar_ref(float, float): 400610: movss %xmm0,0xfffc(%rsp) 400616: lea0xfff8(%rsp),%rcx 40061b: lea0x4(%rdi),%rax 40061f: movss %xmm1,0xfff8(%rsp) 400625: lea0xfffc(%rsp),%rdx 40062a: ucomiss 0x4(%rdi),%xmm1 40062e: cmova %rcx,%rax 400632: ucomiss (%rax),%xmm0 400635: cmovbe %rdx,%rax 400639: ucomiss (%rdi),%xmm1 40063c: movss (%rax),%xmm0 400640: cmovbe %rcx,%rdi 400644: ucomiss (%rdi),%xmm0 400647: cmova %rax,%rdi 40064b: movss (%rdi),%xmm0 40064f: retq Again, sorry for hijacking that thread, but gcse is a convenient scapegoat for most of my performance/codegen problems and i'd like to know if there's mid-term hope. Regards, Thierry.
unexpected speedup from gcc-4.1-20050508
Hello, after setting up the latest snapshot, i was caught off guard as all my numbers were off (and usually it's better than a swiss clock). So, i've double checked, stripped some cruft from compiler command line and pitted various snapshots (20050410, 20050424, 20050501) vs 20050508 in my app. Now i can say without doubt that on x86-64 linux, on a k8, i reliably get between 3% (rendering path, mostly vectorized SSE) and 5% (kd-tree compiler, branchy memory heavy code) performance boost. Without touching a single line of code. I don't know, yet, who's the unsung hero i should thank or what he/she did, or if that result can be correlated in any other benchmark, but that won't stop me to send my warmest kudos his/her way. Feel free to fill in the blanks :) Regards.
Re: GCC 4.0, Fast Math, and Acovea
On 5/3/05, Scott Robert Ladd [EMAIL PROTECTED] wrote: tbp wrote: Granted, POV-Ray may not be state-of-the-art, but then, I know quite a few people who say that (even legitimately) about just about every software product in existence. True. Still, POV has evolved from dkbtrace and it shows sometimes. If you have a suggestion for better benchmarks, I'm listening. Is your ray tracer available? It's way too rough for general consumption yet, and quite specialized anyway (very large geometry). With specific kludges for each compiler, here's the hierarchy for the hand vectorized rendering: ia32: icc8.1, gcc4.1 (-5% at least), msvc2k3 (-20%) x86-64: gcc4.1, icc9.0 (-7% at least) It varies a bit, depending on features being hammered by specific scenes, but the order is unchanged (note that the x86-64 version has only been tested on k8 so far). GCC shows an edge in the SAH kdtree compiler part (branchy code) on x86-64, with a 40% improvement over the ia32 versions (and icc9.1 which definitely gets lost). That's more than welcome, given the time it takes to produce those freaking trees :) Anecdotically gcc is only one to get the parsing of large memory mapped files right (or put another way, the idiom used), being 2x faster than every other compilers on every platform.
Re: GCC 4.0, Fast Math, and Acovea
On 5/2/05, Scott Robert Ladd [EMAIL PROTECTED] wrote: You might want to a look at my just-published review of GCC 4.0, where I compare it's performance on some well-known applications, including LAME and POV-Ray, on Pentium 4 and Opteron. In terms of POV-Ray, 4.0 produced a smaller executable that was slightly slower than did 3.4.3. You can find the full review at: While POV has an impressive array of features and is quite valuable as a large FP intensive legacy standard for compiler writers (or raytracer writers :), i wouldn't consider it state of the art or a speed daemon either; to put it bluntly it's incredibly slow. For those reasons i consider it's not representative of the kind of computationnal performance gcc can extract from a modern CPU at all: again, in my own experience, gcc4.x is light years away from previous versions. Now i'm not familiar enough with the other cited sources to comment.
Re: GCC 4.0, Fast Math, and Acovea
On 4/29/05, Uros Bizjak [EMAIL PROTECTED] wrote: Hello Scott! Hello Scott Uros, Specifically, the -funsafe-math-optimizations flag doesn't work correctly on AMD64 because the default on that platform is -mfpmath=sse. Without specifying -mfpmath=387, -funsafe-math-optimizations does not generate inline processor instructions for most floating-point functions. [snip] It was found that moving data from SSE registers to X87 registers (and back) only to call an x87 builtin degrades performance. Because of this, x87 builtins are disabled for -mfpmath=sse and a normal libcall is issued for sin(), etc functions. If someone wants to use x87 builtins, then _all_ math operations should be done in x87 registers to avoid costly SSE-x87 moves. Shameless plug with my own performance analysis regarding SSE on x86-64. I've ported my coherent raytracer which mostly uses intrinsics in the hot path (and no transcendentals). While gcc4.x compiled binaries are ~5% slower than those compiled with icc8.1 on ia32 (best case), it's the other way around on x86-64 if not more (on my opteron with icc8.1 and beta 9.0). Obviously there's much less pressure on the (cough weak cough) register allocator and in the end the generated code is way leaner. My only gripe with fast-math is that it's the only way to enable some optimizations while making NaNs verbotten; couple that with the lack of cross unit IPO and you're stuck with a kind of nasty global switch (unless you have room for some function calls).
Re: gcc4, static array, SSE alignement
On Apr 6, 2005 3:18 AM, James E Wilson [EMAIL PROTECTED] wrote: I would guess a limitation of cygwin binutils support, or perhaps of Windows itself. Binutils, perhaps. Windows certainly not as msvc2k3 icc8.1 don't have such issue with the same code. This seems to work fine on linux. If I compile a simple example using __alignof__, I see that the compiler is assuming 16-byte alignment. If I compile with -S, I see that the compiler is giving them 32-byte alignment (probably for better cache alignment). If I run objdump -x on the a.out file, I see that .bss section has 2**5 (32-byte) alignment. All is as it should be. Sections: Idx Name Size VMA LMA File off Algn 0 .text 0003e754 00401000 00401000 0400 2**4 CONTENTS, ALLOC, LOAD, READONLY, CODE, DATA 1 .data 4634 0044 0044 0003ec00 2**4 CONTENTS, ALLOC, LOAD, DATA 2 .rdata4884 00445000 00445000 00043400 2**4 CONTENTS, ALLOC, LOAD, READONLY, DATA 3 .bss 8fc0 0044a000 0044a000 2**4 ALLOC 4 .idata1984 00453000 00453000 00047e00 2**2 CONTENTS, ALLOC, LOAD, DATA 5 .stab 00169908 00455000 00455000 00049800 2**2 CONTENTS, READONLY, DEBUGGING, NEVER_LOAD, EXCLUDE 6 .stabstr 001c39e1 005bf000 005bf000 001b3200 2**0 CONTENTS, READONLY, DEBUGGING, NEVER_LOAD, EXCLUDE Gcc the toolchain used to have lots of issues wrt alignement, but i thought they were a thing of the past. As far as i can see, everything is fine regarding section alignements. A real bug report, as per http://gcc.gnu.org/bugs.html would be useful here, so we can check. In particular, a testcase to reproduce the problem, and exactly what you believe is wrong with the output. Yep, but i was testing the water. I mean i have lots of other 16 bytes aligned data (static, extern, const or not and whatnot) in there and the only problematic one is this semi large static. So, because that's the largest, i thought i've crossed some magic size threshold. I'll try to pinpoint the problem a bit better.
Re: gcc4, static array, SSE alignement
On Apr 6, 2005 2:08 PM, tbp [EMAIL PROTECTED] wrote: I'll try to pinpoint the problem a bit better. Alas, since the other day the code using that static array has changed a bit and i can't reproduce the bug. So, after all, it really was gcc's fault. I'll try to dig up the original version.
gcc4, namespace and template specialization problem
Hello, i'm a bit puzzled by the behaviour of gcc4 (old 4.0 recent 4.1 snapshots) regarding how template specialization should be qualified wrt namespace: namespace dummy { struct foo { template int i void f() {} }; } template void dummy::foo::f666() {} testcase.cpp:30: error: specialization of 'templateint i void dummy::foo::f()' in different namespace testcase.cpp:27: error: from definition of 'templateint i void dummy::foo::f()' It has to be written this way: namespace dummy { template void dummy::foo::f666() {} or template void foo::f666() {} } Other compilers (gcc 3.4.x, msvc2k3, icc8.1) don't whine. Am i missing something obvious?
Re: gcc4, namespace and template specialization problem
On Apr 4, 2005 11:54 AM, Nathan Sidwell [EMAIL PROTECTED] wrote: Am i missing something obvious? well, not 'obvious', but that is what [14.7.3]/2 says. I especially don't quite get why specialization have to be defined that way when non specialized version don't have to, ie that is legit: namespace dummy { struct foo { template int i void f(); }; } templateint i void dummy::foo::f() { } But if that's the law... Thanks for clue.
Re: gcc4, namespace and template specialization problem
On Apr 4, 2005 12:21 PM, Nathan Sidwell [EMAIL PROTECTED] wrote: That's not a declaration, it's a definition of an already declared fn. the case you had was a definition that was _also_ a declaration. [...] See the difference? Yes, and i know about it... Although it is kind of quirky that you can declare member function specializations outside of the class, but not outside of the namespace. .. but that's that inconsistency that puzzled/confused me. Sorry for the noise, but i don't own a copy of that byzantine standard.
Re: gcc4, namespace and template specialization problem
On Apr 4, 2005 12:50 PM, Jonathan Wakely [EMAIL PROTECTED] wrote: GCC 3.4 *does* whine, and I think Intel will in strict mode. Can't get neither gcc 3.4.1 to whine about it (-Wall) nor icc 8.1 with the highest warning level enabled.