Re: g++, trunk, recent weird mismatch for arguments with forwarded declaration when attributes are involved

2010-09-14 Thread tbp
Hello,
i know it's no good form to reply to self, or be that insistent, but
i've been hit again.

In the bug report discussion, i've been told by A. Pinski that, as of
now, forward declarations shall have matching attributes. That's fine,
i suppose. What's not is that:
 . that new behavior, as far as i know, isn't documented anywhere.
 . there's no warning or error at the declaration/definition point.
 . it's not consistent (non-compliance only fail in some unknown condition).
 . when you finally get an error, it will be about a vaguely related
prototype mismatch somewhere.

Would it be possible to have some clarifications? Shall i file a PR
for a warning? Sacrifice a goat?

PS: now i know better, but i can assure you, anyone running into that
issue is bound to waste tremendous amounts of time trying to figure
out what's wrong with their prototype.


-Wdouble-promotion noise

2010-09-14 Thread tbp
Hello,
I could really use -Wdouble-promotion but, atm, it appears quite impractical,
$ cat double.cc
#include cstdio
void foo(...);
int main() {
float f = 1;
foo(f);
printf(%f, f);
}
$ /usr/local/gcc-4.6-20100913/bin/g++ -Wdouble-promotion double.cc
double.cc: In function 'int main()':
double.cc:5:7: warning: implicit conversion from 'float' to 'double'
when passing argument to function [-Wdouble-promotion]
double.cc:6:16: warning: implicit conversion from 'float' to 'double'
when passing argument to function [-Wdouble-promotion]

... and the interesting bits are lost in the noise. I can't think of a
workaround.
So i have to ask: Is that how it's meant to be, or simply a temporary
shortcoming? Have i missed an obvious kludge?


Re: g++, trunk, recent weird mismatch for arguments with forwarded declaration when attributes are involved

2010-09-14 Thread tbp
On Tue, Sep 14, 2010 at 4:51 PM, Ian Lance Taylor i...@google.com wrote:
 Please do file a PR if there isn't one already.  Thanks.
I have no idea if that could happen outside C++ and couldn't find
anything relevant, thus
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45668
That's the best i can do.
And thanks for your assistance.


Re: -Wdouble-promotion noise

2010-09-14 Thread tbp
On Tue, Sep 14, 2010 at 4:58 PM, Ian Lance Taylor i...@google.com wrote:
 This question is not appropriate for the mailing list g...@gcc.gnu.org.
 ...
 This is among the kinds of things which -Wdouble-promotion is documented
 to warn about, so, yes, this is how it's meant to be.
Honestly i've pondered not sending that previous mail, but, i guess i
just disagreed with the prescribed use-case; frankly i wonder how one
can expect end-users to forgo or patch each  every variadic function
to use it.
I'll get back grepping for *pd/*sd and the likes as is it infinitely
more practical.
Sorry for fuss.


Re: -Wdouble-promotion noise

2010-09-14 Thread tbp
On Tue, Sep 14, 2010 at 7:14 PM, Ian Lance Taylor i...@google.com wrote:
 What is it that you want?
I'd like to have a warning for when a value of type float is
implicitly promoted to double, for performance reasons (on x86). Note
that in that context, caring about variadic functions makes little
sense to begin with (by the time prologue is done, any notion of
performance is a fairy tale).
I can't use
  -Wconversion, way too much noise.
  -Wunsuffixed-float-constants, unavailable in C++.
  -fsingle-precision-constant, indiscriminate and wrong.
either, and, because there may be some debugging/pretty-printing code
around, -Wdouble-promotion is useless.
So, i'm back to grep.

Tho, i got to say, it was really evil to tease me like that with
-Wdouble-promotion :)


Re: -Wdouble-promotion noise

2010-09-14 Thread tbp
On Tue, Sep 14, 2010 at 8:44 PM, Ian Lance Taylor i...@google.com wrote:
 Let me put it a different way: what is it that you want, expressed in
 terms of C/C++ code?  What should the compiler be warning about?
Hmm. I think the provided example captures most of what i care about,
  float area(float radius) { return 3.14159 * radius * radius; }
and even forgotten/disguised uses of, say, M_PI, would fall in that
category after-all.

   -Wunsuffixed-float-constants, unavailable in C++.
 It seems that this could be added to C++ easily enough.
If i'm not mistaken (i've never put it to real use)
-Wunsuffixed-float-constants would handle that. So, it wouldn't be as
airtight as -Wdouble-promotion but still good enough and a fantastic
improvement.


Re: -Wdouble-promotion noise

2010-09-14 Thread tbp
On Tue, Sep 14, 2010 at 9:47 PM, Ian Lance Taylor i...@google.com wrote:
 So far my best guess is that your definition is
 warn about implicit conversions from float to double except for those
 conversions caused by default argument promotion applied to arguments
 passed to unnamed parameters.  Is that what you want?
Hypothetically i'd grok what you just wrote and answer clearly, but
i'm no language lawyer and only use compilers: I don't know.
I know what i don't want: be surprised to find unwanted
non-single-precision instructions or data in critical parts of the
binary at the far end of the tool chain. That's my definition ;)


g++, trunk, recent weird mismatch for arguments with forwarded declaration when attributes are involved

2010-09-10 Thread tbp
Since about 2010/09/07 i've had a weird error with a mismatched
prototype involving an argument once forward declared as 'class foo;'
and later defined as 'class __attribute((aligned(16))) foo {...};', a
bit like
namespace n1 {
  class fwd;
  namespace n2 {
class foo {
  void bar(fwd );
};
  }
  class __attribute((aligned(16))) fwd {};
}
// error prototype for... candidate is... would be here.
void n1::n2::foo::bar(n1::fwd ) {}

Except that's no testcase because it works: to kludge around, i have
to forward declare with matching attributes (or get an error).
I fail to reduce it, and the source code is large  fugly; i'd prefer
not to have to disclose, hence no formal bug report.
My hope being that will ring a bell for whoever's responsible :)

PS: x86-64/32, linux, -std=c++0x.


Re: g++, trunk, recent weird mismatch for arguments with forwarded declaration when attributes are involved

2010-09-10 Thread tbp
On Fri, Sep 10, 2010 at 5:20 PM, Ian Lance Taylor i...@google.com wrote:
 Since you do have a test case, you could try using a tool like delta to reduce
 it to something that you can share.
My delta-fu is too weak to get anywhere with an error so easily
produced (mismatched prototype, plus g++ senseless diagnostic doesn't
help either).
I've given up and submitted http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45642.
But thanks for the input.


Re: g++ 4.5.0, end-user disappointment and interrogations

2010-04-22 Thread tbp
On Thu, Apr 22, 2010 at 6:36 AM, Paolo Carlini paolo.carl...@oracle.com wrote:
 In any case, keep in mind that constexpr are not available yet, maybe the
 parser can already recognize some uses but the semantics is not done yet.
Ah, so it was nothing but smokes  mirrors. Thanks for the clarification.


Re: g++ 4.5.0, end-user disappointment and interrogations

2010-04-22 Thread tbp
On Thu, Apr 22, 2010 at 7:23 AM, Xinliang David Li davi...@google.com wrote:
 The dead store problem seems to be a regression in SRA.
Thanks for looking into it.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43846


Re: g++ 4.5.0, end-user disappointment and interrogations

2010-04-22 Thread tbp
On Fri, Apr 23, 2010 at 5:48 AM, Dave Korn
dave.korn.cyg...@googlemail.com wrote:
  Dear tbp, please don't accuse people of being deceptive or fraudulent, it is
 not a nice thing to do.
Indeed. That wasn't the intent.
Seeing libstdc++ being combed over for constexpr, i've conveniently
fooled myself into believing my hopes were realised.
That i am ignorant, hermetic to facts and terminally clueless is, i
fear, congruent to my current condition of end-user.

PS: If not for the risk of further aggravation, i'd contend that i
have nowhere accused people; but that's best forgotten.


g++ 4.5.0, end-user disappointment and interrogations

2010-04-21 Thread tbp
Hello,

having finally built myself a 4.5.0 (linux x86-64), i've quickly tried
it on some of my code and it soon became apparent some things weren't
for the better.
Here's my febrile attempt to sum up what surprised me
$ cat huh.cc
#include cmath
#if __GNUC__ * 100 + __GNUC_MINOR__  405
#define constexpr
#endif
struct foo_t {
float x, y, z;
foo_t() {}
constexpr foo_t(float a, float b, float c) : x(a),y(b),z(c) {}
friend foo_t operator*(foo_t lhs, float s) { return foo_t(lhs.x*s,
lhs.y*s, lhs.z*s); }
friend float dot(foo_t lhs, foo_t rhs) { return lhs.x*rhs.x +
lhs.y*rhs.y + lhs.z*rhs.z; }
};
struct bar_t {
float m[3];
bar_t() {}
constexpr bar_t(float a, float b, float c) : m{a, b, c} {}
friend bar_t operator*(bar_t lhs, float s) { return bar_t(lhs.m[0]*s,
lhs.m[1]*s, lhs.m[2]*s); }
friend float dot(bar_t lhs, bar_t rhs) { return lhs.m[0]*rhs.m[0] +
lhs.m[1]*rhs.m[1] + lhs.m[2]*rhs.m[2]; }
};
namespace {
templatetypename T float magsqr(T v) { return dot(v, v); }
templatetypename T T norm(T v) { return v*(1/std::sqrt(magsqr(v))); }
constexpr foo_t foo(1, 2, 3);
constexpr bar_t bar(1, 2, 3);
}
void frob1(const foo_t a, foo_t b) { b = norm(a); }
void frob2(const bar_t a, bar_t b) { b = norm(a); }
int main() { return 0; }
$ g++ -std=c++0x -O3 -march=native -ffast-math -mno-recip huh.cc

a) Code produced for frob1 and frob2 differ (a dead store isn't
removed with the array variant), when they used not to (for example
with g++ 4.4.1); that's a really annoying regression (can't index
foo_t members etc...).
b) Note the rsqrtss in there: -ffast-math turns
-funsafe-math-optimizations on which, now, also turns on
-freciprocal-math; the old -m[no-]recip switch that used to direct the
emission of reciprocals is useless; no warnings of any sort emitted.
The only mention of the new behaviour is in the manual (nothing in
http://gcc.gnu.org/gcc-4.5/changes.html).
c) constexpr apparently makes no difference, stuff still gets
constructed/stored at runtime. Vectors aren't allowed either: error:
parameter '__vector(4) float v' is not of literal type; even if that's
what the standard say, it would have been handy.

Q:
Is the dead store removal/fuss with arrays a known/transient issue
soon to be fixed (again)?
Would it be possible to foolproof
-ffast-math/-freciprocal-math/-mrecip in some way?
What's the deal with constexpr (or what can i reasonably expect)?


Re: __attribute__((optimize)) and fast-math related oddities

2009-10-19 Thread tbp
On Mon, Oct 19, 2009 at 7:34 PM, Ian Lance Taylor i...@google.com wrote:
 Please file a bug report.
 __attribute__((optimize())) is definitely only half-baked.
Apparently the code i've posted is just a variation around that 1 year
old PR 37565 and if that doesn't work, worrying about the rest is
entirely futile.
Half baked you say? It's comforting to see that much optimism but
couldn't the doc be adjusted a bit to reflect the fact that the baker
got hit by a bus or something?

PS: i'm sorry that i've missed that PR in my search, but i presumed
the issue was much more specific.


__attribute__((optimize)) and fast-math related oddities

2009-10-17 Thread tbp
Hang on while i put on my flame-proof suit. There.
Merrily trying to make a test-case showing how unmanageable it is to
try to override *math* flags per function, i soon had to stop
because...
$ cat amusing.cc
#include cmath
static __attribute__((optimize(-fno-associative-math))) double
foo1(double x) { return (x + pow(2, 52)) - pow(2, 52); }
static __attribute__((noinline)) double bar1(double x) { return foo1(x); }
#ifdef HUH
static __attribute__((optimize(-fno-associative-math))) double
foo2(double x) { return (x + pow(2, 52)) - pow(2, 52); }
static __attribute__((noinline, optimize(-ffast-math))) double
bar2(double x) { return foo2(x); }
#endif
int main() {
double x = 1.1;
if (bar1(x) == x) return 1;
#ifdef HUH
if (bar2(x) == x) return 2;
#endif
return 0;
}
$ g++-4.4 -O2 amusing.cc -ffast-math  ./a.out; echo $?
0
$ g++-4.4 -O2 amusing.cc -ffast-math -DHUH  ./a.out; echo $?
1
$ g++-4.4 -O2 amusing.cc  ./a.out; echo $?
0
$ g++-4.4 -O2 amusing.cc -DHUH  ./a.out; echo $?
1
... made even less sense than expected.
I got that like for other 'incompatible' flags, conflicting math flags
should prevent inlining, only they don't. And it's all weird. But that
one takes the cake. Could someone tell me what the fuss is about?

$ g++-4.4 -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian
4.4.1-6' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--enable-shared --enable-multiarch --enable-linker-build-id
--with-system-zlib --libexecdir=/usr/lib --without-included-gettext
--enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4
--program-suffix=-4.4 --enable-nls --enable-clocale=gnu
--enable-libstdcxx-debug --enable-mpfr --enable-objc-gc
--with-arch-32=i486 --with-tune=generic --enable-checking=release
--build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 4.4.1 (Debian 4.4.1-6)


How are we supposed to play along the autovectorizer in c++? (alignment issues)

2008-07-29 Thread tbp
Hello.
the autovectorizer is enabled by default in g++ 4.3 and does a fine
job most of the time. Except it gets mightily pissed off if you dare
to tweak the alignment and after much experimentation i haven't yet
devised how to plug all the holes.
This silly example shows where things start to get ugly
#  cat autovec.cc
enum { N = 4, align_to = 16/sizeof(char) };
typedef float scalar_type;
struct foo_t {
scalar_type m[N];
foo_t operator +(const foo_t rhs) const { foo_t v(*this); for
(unsigned i=0; iN; ++i) v.m[i] += rhs.m[i]; return v; }
};
struct  bar_t {
scalar_type  __attribute__((aligned(sizeof(char)*align_to))) m[N];
bar_t operator +(const bar_t rhs) const { bar_t v(*this); for
(unsigned i=0; iN; ++i) v.m[i] += rhs.m[i]; return v; }
};

templatetypename T __attribute__((noinline)) void foobar(T dst,
const T *src) {
T v = {{ 0 }};
for (unsigned i=0; i64; ++i) v = v + src[i];
dst = v;
}

int main(int argc, char *argv[]) {
foo_t *p((foo_t*) argv);
bar_t *q((bar_t*) argv);
foobar(*p, p + 1);
foobar(*q, q + 1);
return 0;
}
# g++ -O3 -march=native autovec.cc # g++ 4.3.1, x86_64

There's not much to say about foobarfoo_t and the addition in
foobarbar_t gets somewhat vectorized but
  400620:   89 54 24 f4 mov%edx,-0xc(%rsp)
  400624:   89 4c 24 f0 mov%ecx,-0x10(%rsp)
  400628:   44 89 44 24 ec  mov%r8d,-0x14(%rsp)
  40062d:   44 89 4c 24 e8  mov%r9d,-0x18(%rsp)
  400632:   0f 28 c1movaps %xmm1,%xmm0
  400635:   0f 12 04 06 movlps (%rsi,%rax,1),%xmm0
  400639:   0f 16 44 06 08  movhps 0x8(%rsi,%rax,1),%xmm0
  40063e:   48 83 c0 10 add$0x10,%rax
  400642:   41 0f 58 02 addps  (%r10),%xmm0
  400646:   48 3d 00 04 00 00   cmp$0x400,%rax
  40064c:   41 0f 29 02 movaps %xmm0,(%r10)
  400650:   8b 54 24 f4 mov-0xc(%rsp),%edx
  400654:   8b 4c 24 f0 mov-0x10(%rsp),%ecx
  400658:   44 8b 44 24 ec  mov-0x14(%rsp),%r8d
  40065d:   44 8b 4c 24 e8  mov-0x18(%rsp),%r9d
  400662:   75 bc   jne400620 void
foobarbar_t(bar_t, bar_t const*)+0x20

as you can see there's a lot of undue load/store. And that's for a POD
(or something really looking like one).
So, you start fixing that with some looping copy ctor/operator (surely
losing the POD property in the process) and so on. Doing that i can
fix most reload issues, but stores are much more elusive (note that it
depends on the underlying type  its natural alignment).
Ideally i'd like PODs to remain PODs, and synthetized ctor/operators
to be efficient (ie not falling back to using gpr based memcpy when
everything is in an XMM register already), or at least a consistent
way how such ctor/operators can be written (and dead store removed).

Briefly: how am i supposed to decorate my structures with larger
aligment and not royally piss off the autovectorizer (and g++ in
general)?


Re: censored naked SSE reciprocals, -mrecip

2007-12-29 Thread tbp
On Dec 29, 2007 4:35 PM, Uros Bizjak [EMAIL PROTECTED] wrote:
 Attached patch fixes these problems by using correct shortcuts when
 generating intrinsic functions.

 Patch was bootstrapped and regression tested with {,-m32} on
 x86_64-pc-linux-gnu. Patch is committed to SVN.

 Thanks a lot for your report,
Now that's blazing fast after-sales service. And i get no less than
two undocumented but functional builtins (as opposed to, say
__builtin_ia32_movddup, which is documented but dysfunctional) for the
same price.
As an extremely satisfied customer, i want to nominate you for the
2007 man of the year short list.


censored naked SSE reciprocals, -mrecip

2007-12-28 Thread tbp
Merry xmas,

i lately had some use for -mrecip but it turned out to come with all
sorts of strings attached and apparently no opt-out. Briefly, barring
inline asm, i can't get gcc to emit those ops without a NR fixup.

# cat src/pr-recip.c
#include xmmintrin.h
typedef float v4sf_t __attribute__ ((__vector_size__ (16)));

__m128 foo(__m128 a) { return _mm_sqrt_ps(a); }
__m128 bar(__m128 a) { return _mm_rsqrt_ps(a); }
__m128 baz(__m128 a) { return _mm_rcp_ps(a); }

v4sf_t nope1(v4sf_t a) { return __builtin_ia32_sqrtps(a); }
v4sf_t nope2(v4sf_t a) { return __builtin_ia32_rsqrtps(a); }
v4sf_t allright(v4sf_t a) { return __builtin_ia32_rcpps(a); }

int main() { return 0; }
# /usr/local/gcc-4.3-20071221/bin/gcc -march=native -ffast-math
-mrecip -O2 src/pr-recip.c
... and as can be witnessed in the attached asm dump foo, bar, nope1,
nope2 get mangled (at least on x86-64 linux).

While i can somehow understand the logic behind the automatic
transformation of _mm_sqrt_ps - it can be argued that's what the user
has asked for - there's no obvious way to opt out. But then i really
don't understand why gcc feels the urge to tinker when i specifically
ask for a rsqrt.
To add insult to injury -mrecip, unlike fast-math, doesn't set any
macro so kludging around is a cat / mouse game.

Questions:
  a) is that really by design?
  b) what's the official way to dodge fixups when -mrecip is active?
  c) any chance for -mrecip to set __FAST_MATH_NONE_SHALL_PASS__ or something?


dump.asm
Description: Binary data


Re: Function specific optimizations call for discussion

2007-11-29 Thread tbp
On Nov 29, 2007 9:29 PM, Weddington, Eric [EMAIL PROTECTED] wrote:
 and I would also postulate the general embedded community, would
 *really* like to have this functionality, especially your Stage 1. There
 are many AVR, or embedded, applications where they are generally
 optimized for size, but have a time-critical function that needs to be
 optimized for speed.
I would personally, and i think it hasn't been evoked  yet, *really*
like to be able to toggle fast-math (or related flags) per function,
basically for the same reason.


Re: recent troubles with float vectors bitwise ops

2007-08-24 Thread tbp

Mark Mitchell wrote:

One option is for the user to use intrinsics.  It's been claimed that
results in worse code.  There doesn't seem any obvious reason for that,
but, if true, we should try to fix it; we don't want to penalize people
who are using the intrinsics.  So, let's assume using intrinsics is just
as efficient, either because it already is, or because we make it so.
I maintain that empirical claim; if i compare what gives a simple SOA 
hybrid 3 coordinates something implemented via intrinsics, builtins and 
vector when used as the basic component for a raytracer kernel i get as 
many codegen variations: register allocations differ, stack footprints 
differ, branches  code organization differ, etc... so it's not that 
surprising performance also differ. It appears the vector  builtin 
(which isn't using __m128 but straight v4sf) implementations are mostly 
on par while the intrinsic based version is slightly slower.
Then you factor in how convenient it is, well... was, to use that vector 
extension to write such something...


Another issue is that for MSVC and ICC, __m128 is a class, but not for 
gcc so you need more wrapping in C++ but if you know you can let some 
naked v4sf escape because the compiler always does the right thing with 
them.


Now while there's some subtleties (and annoying 'features'), i should 
state that gcc4.3, if you're careful, generates mostly excellent SSE 
code (especially on x86-64, even more so if compared to icc).



We still have the problem that users now can't write machine-independent
code to do this operation.  Assuming the operations are useful for
That and writing, say, a generic int,float,double something takes much 
much more work.



What are these operation used for?  Can someone give an example of a
kernel than benefits from this kind of thing?
There's of course what Paolo Bonzini described, but also all kind tricks 
that knowing such operations are extremely efficient encourages.
While it would be nice to have such builtins also operate on vectors, if 
only because they are so common, it's not quite the same as having full 
freedom and hardware features exposed.





Re: recent troubles with float vectors bitwise ops

2007-08-24 Thread tbp

Paolo Bonzini wrote:

I'm not sure that it is *so* useful for a user to have access to it,
except for specialized cases:
As there's other means, it may not be that useful but for sure it's 
extremely convenient.



2) selection operations on vectors, kind of (v1 = v2 ? v3 : v4).  These
can be written for example like this:

 cmpleps xmm1, xmm2   ; xmm1 = xmm1 = xmm2 ? all-ones : 0
 andnps  xmm4, xmm1   ; xmm4 = xmm1 = xmm2 ? 0 : xmm4
 andps   xmm1, xmm3   ; xmm1 = xmm1 = xmm2 ? xmm3 : 0
 orpsxmm1, xmm4   ; xmm1 = xmm1 = xmm2 ? xmm3 : xmm4
I suppose you'll find such variant of a conditional move pattern in 
every piece of SSE code.


But you can't condense bitwise vs float usage to a few patterns because 
when writing SSE, the efficiency of those operations is taken for granted.




If we have a good extension for vector arithmetic, we should aim at
improving it consistently rather than extending it in unpredictable
ways.  For example, another useful extension would be the ability to
access vectors by item using x[n] (at least with constant expressions).

Yes, yes and yes.





Re: recent troubles with float vectors bitwise ops

2007-08-23 Thread tbp

Ross Ridge wrote:

If I were tbp, I'd just code all his vector operatations using intrinsics.
The other responses in this thread have made it clear that GCC's vector
arithemetic operations are really only designed to be used with the Cell
Broadband Engine and other Power PC processors.
Thing is my main use for that extension is for a specialization (made on 
a rainy day out of boredom) of a basic something re-used all over in my 
code; the default implementation uses intrinsics.
It turns out, when benchmarked, that i get better code with the 
specialization. So it's more convenient and faster, win/win.


I'm unsure why the code is better in the end, perhaps because the 
may_alias attribute of __m128, perhaps because some builtins which are 
used to implement those intrinsics are mistyped (ie v4si 
__builtin_ia32_cmpltps (v4sf, v4sf))... i don't know, i'd need to try a 
builtin based specialization.


In any case that vector extension is now totally useless on x86 and 
conflicts with the documentation.


Re: recent troubles with float vectors bitwise ops

2007-08-23 Thread tbp

Andrew Pinski wrote:

Which hardware (remember GCC is a generic compiler)?  VMX/Altivec and
SPU actually does not have different instructions for bitwise
and/ior/xor for different vector types (it is all the same
instruction).  I have ran into ICEs with even bitwise on vector
float/double on x86 also in the past which is the other reason why I
disabled them.  Since this is an extension, it would be nice if it was
nicely defined extension which means disabling them for vector
float/double.

It *was* neatly defined:
The types defined in this manner can be used with a subset of normal 
C operations. Currently, GCC will allow using the following operators on 
these types: +, -, *, /, unary minus, ^, |, , ~..



So can you, pretty please, also patch the documentation and maybe point 
to the Altivec spec as it's obviously the only one relevant no matter 
what platform you're on?


Re: recent troubles with float vectors bitwise ops

2007-08-23 Thread tbp

Paolo Bonzini wrote:
To some extent I agree with Andrew Pinski here.  Saying that you need 
support in a generic vector extension for vector float | vector float 
in order to generate ANDPS and not PXOR, is just wrong.  That should be 
done by the back-end.
I guess i fail to grasp the logic mandating that the intended source 
level, strictly typed, 'vector float | vector float' should be mangled 
into an int op with frantic casts to magically emerge out from the 
backend as the original 'vector float | vector float', but i'm not a 
compiler maintener: for me it smells like a regression.


Re: recent troubles with float vectors bitwise ops

2007-08-23 Thread tbp

Paolo Bonzini wrote:
Because it's *not* strictly typed.  Strict typing means that you accept 
the same things accepted for the element type.  So it's not a 
regression, it's a bug fix.

# cat regressionorbugfix.cc
typedef float v4sf_t __attribute__ ((__vector_size__ (16)));
typedef int v4si_t __attribute__ ((__vector_size__ (16)));
v4sf_t foo(v4sf_t a, v4sf_t b, v4sf_t c) {
return a + (b | c);
}
v4sf_t bar(v4sf_t a, v4sf_t b, v4sf_t c) {
return a + (v4sf_t) ((v4si_t) b | (v4si_t) c);
}
int main() { return 0; }

00400a30 foo(float __vector, float __vector, float __vector):
  400a30:   orps   %xmm2,%xmm1
  400a33:   addps  %xmm1,%xmm0
  400a36:   retq

00400a40 bar(float __vector, float __vector, float __vector):
  400a40:   por%xmm2,%xmm1
  400a44:   addps  %xmm1,%xmm0
  400a47:   retq

I'm surely not qualified to argue about typing, but you'd need a rather 
strong distortion field to not characterize that as a regression.




Re: recent troubles with float vectors bitwise ops

2007-08-23 Thread tbp
On 8/23/07, Paolo Bonzini [EMAIL PROTECTED] wrote:
 I've added 5 minutes ago an XFAILed test for exactly this code.  OTOH, I
 have also committed a fix that will avoid producing tons of shuffle and
 unpacking instructions when function bar is compiled with -msse but
 without -msse2.
Thanks.

 I'm also going to file a missed optimization bug soon.
Ditto.

 I'm curious, does ICC support vector arithmetic like this? Do both
 functions compile? What code does it produce for bar?
No, icc9/10 only provide basic support for that extension (and then
only on linux i think)
# /opt/intel/cce/9.1.051/bin/icpc regressionorbugfix.cc
regressionorbugfix.cc(5): error: no operator | matches these operands
operand types are: v4sf_t | v4sf_t
return a + (b | c);
  ^

regressionorbugfix.cc(8): error: no operator | matches these operands
operand types are: v4si_t | v4si_t
return a + (v4sf_t) ((v4si_t) b | (v4si_t) c);
^

but then it's more aggressive about intrinsics than gcc.
Like i said somewhere i got slightly better results when using that
extension than intrinsics with gcc 4.3 but haven't checked if i could
get the same result with builtins yet.


Re: recent troubles with float vectors bitwise ops

2007-08-23 Thread tbp
On 8/23/07, Tim Prince [EMAIL PROTECTED] wrote:
 The primary icc/icl use of SSE/SSE2 masking operations, of course, is in
 the auto-vectorization of fabs[f] and conditional operations:

   sum = 0.f;
   i__2 = *n;
   for (i__ = 1; i__ = i__2; ++i__)
   if (a[i__]  0.f)
   sum += a[i__];
  (Windows/intel asm syntax)
pxor  xmm2, xmm2
cmpltps   xmm2, xmm3
andps xmm3, xmm2
addps xmm0, xmm3
 ...
Note that icc9 has a strong bias for pentium4, which had no stall
penalty for mistyped fp vectors as for Intel it came with the pentium
M line, so you see a pxor even if generating code for the core2.
# cat autoicc.cc
float foo(const float *a, int n) {
float sum = 0.f;
for (int i = 0; i n; ++i)
if (a[i]  0.f)
sum += a[i];
return sum;
}
int main() { return 0; }
# /opt/intel/cce/9.1.051/bin/icpc -O3 -xT autoicc.cc
autoicc.cc(3) : (col. 2) remark: LOOP WAS VECTORIZED.
  4007a9:   pxor   %xmm4,%xmm4
  4007ad:   cmpltps %xmm3,%xmm4
  4007b1:   andps  %xmm3,%xmm4
# /opt/intel/cce/10.0.023/bin/icpc -O3 -xT autoicc.cc
autoicc.cc(3): (col. 2) remark: LOOP WAS VECTORIZED.
  400b50:   xorps  %xmm3,%xmm3
  400b53:   cmpltps %xmm4,%xmm3
  400b57:   andps  %xmm3,%xmm4


Re: recent troubles with float vectors bitwise ops

2007-08-22 Thread tbp
On 8/22/07, Paolo Bonzini [EMAIL PROTECTED] wrote:
 I think you're running too far with your sarcasm. SSE's instructions do
 not go so far as to specify integer vs. floating point.  To me, ps
 means 32-bit SIMD, independent of integerness.
Excuse me if i'm amazed being replied  bitwise ops on floating values
make no sense as the justification for breaking something that used to
work and match hardware features. I naively thought that was the
purpose of that convenient extension.

  So, that's what i feared... it was intentional. And now i guess the only
  sanctioned access to those ops is via builtins/intrinsics.
 No, you can do so with casts.  Floating-point to integer vector casts
 preserve the bit pattern.  For example, you can do
Again SIMD ops (among them bitwise stuff) comes in 3 mostly symmetric
flavors on x86 namely for int, float and doubles; casting isn't
innocuous because there's a penalty for type mismatch (1 cycle of
re-categorization if i remember for both k8 and core2), so it's either
that or some moving around.

Let me cite Intel(r) 64 and IA-32 Architectures Optimization
Reference Manual,  5-1,
When writing SIMD code that works for both integer and floating-point
data, use
the subset of SIMD convert instructions or load/store instructions to
ensure that
the input operands in XMM registers contain data types that are
properly defined
to match the instruction.
Code sequences containing cross-typed usage produce the same result across
different implementations but incur a significant performance penalty. Using
SSE/SSE2/SSE3/SSSE3 instructions to operate on type-mismatched SIMD data
in the XMM register is strongly discouraged.

You could find a similar note in AMD's doc for the k8.


recent troubles with float vectors bitwise ops

2007-08-21 Thread tbp

Hello,
# cat vecop.cc
templatetypename T T foo() {
T
a = { 0, 1, 2, 3 }, b = { 4, 5, 6, 7 },
c = a | b,
d = c  b,
e = d ^ b;
return e;
}
int main() {
typedef float v4sf_t __attribute__ ((__vector_size__ (16)));
typedef int v4si_t __attribute__ ((__vector_size__ (16)));
foov4si_t();
foov4sf_t();
return 0;
}
# /usr/local/gcc-4.3-svn.old5/bin/g++ -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --prefix=/usr/local/gcc-4.3-svn 
--enable-languages=c,c++ --enable-threads=posix --disable-checking 
--disable-nls --disable-shared --disable-win32-registry 
--with-system-zlib --disable-multilib --verbose --with-gcc=gcc-4.2 
--with-gnu-ld --with-gnu-as --enable-checking=none --disable-bootstrap

Thread model: posix
gcc version 4.3.0 20070808 (experimental)
# /usr/local/gcc-4.3-svn.old5/bin/g++ vecop.cc
# /usr/local/gcc-4.3-svn.old6/bin/g++ -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --prefix=/usr/local/gcc-4.3-svn 
--enable-languages=c,c++ --enable-threads=posix --disable-checking 
--disable-nls --disable-shared --disable-win32-registry 
--with-system-zlib --disable-multilib --verbose --with-gcc=gcc-4.2 
--with-gnu-ld --with-gnu-as --enable-checking=none --disable-bootstrap

Thread model: posix
gcc version 4.3.0 20070819 (experimental)
# /usr/local/gcc-4.3-svn.old6/bin/g++ vecop.cc
vecop.cc: In function 'T foo() [with T = float __vector__]':
vecop.cc:13:   instantiated from here
vecop.cc:4: error: invalid operands of types 'float __vector__' and 
'float __vector__' to binary 'operator|'
vecop.cc:5: error: invalid operands of types 'float __vector__' and 
'float __vector__' to binary 'operator'
vecop.cc:6: error: invalid operands of types 'float __vector__' and 
'float __vector__' to binary 'operator^'


Apparently it's still there as of right now, on x86-64 at least. I think 
this is not supposed to happen but i'm not sure, hence the mail.


Re: recent troubles with float vectors bitwise ops

2007-08-21 Thread tbp

Ian Lance Taylor wrote:

What does it mean to do a bitwise-or of a floating point value?
Apparently enough for a small vendor like Intel to propose such things 
as orps, andps, andnps, and xorps.
So, that's what i feared... it was intentional. And now i guess the only 
sanctioned access to those ops is via builtins/intrinsics. Great.
If only i could get the same quality of code when using intrinsics to 
begin with...




Re: g++ 4.3, troubles with C++ indexing idioms

2007-07-24 Thread tbp

On 7/24/07, Richard Guenther [EMAIL PROTECTED] wrote:

For performance small arrays should be the same as individual members
(I can see the annoying fact that initialization is a headache - this has
annoyed me as well).  For larger arrays (4 members), aliasing will
make a difference possibly, making the array variant slower.  Any union
variant is expected to be slower for aliasing reasons (we do not do
field-sensitive aliasing for unions).

Confirmed :)
And thanks for the clue about the threshold.


In the end I would still recommend to go with array variants.

I guess wishful thinking, or heresy, got me asking for a sanctioned
address-this-as-an-array idiom.; now i'll go with the flow and use
those 2nd class citizens of C++ aka array, even if i'm a bit sceptical
about the performance equivalence (granted it isn't as obvious as it
used to be, i need to investigate some more).
But for sure it can't be as terrible as unions...


Re: g++ 4.3, troubles with C++ indexing idioms

2007-07-21 Thread tbp

On 7/19/07, Richard Guenther [EMAIL PROTECTED] wrote:

Of course, if any then the array indexing variant is fixed.  It would be nice
to see a complete testcase with a pessimization, maybe you can file
a bugreport about this?

There's many issues for all alternatives and i'm not qualified to
pinpoint them further.
I've taken http://ompf.org/ray/sphereflake/ which is used as a
benchmark already here
http://www.suse.de/~gcctest/c++bench/raytracer/, because it's small,
self contained and has such a basic 3 component class that's used all
over.
It doesn't use any kind of array access operator, but it's good enough
to show the price one has to pay before even thinking of providing
some. It has been adjusted to use floats and access members through
accessors (to allow for a straighter comparison of all cases).

variation 0 is the reference, a mere struct { float x,y,z; ...};,
performs as good as the original, but wouldn't allow for any 'valid'
indexing.
variation 1 is struct { float f[3]; ... }
variations 2,3,4,5 try to use some union

# /usr/local/gcc-4.3-20070720/bin/g++ -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --prefix=/usr/local/gcc-4.3-20070720
--enable-languages=c,c++ --enable-threads=posix --disable-checking
--disable-nls --disable-shared --disable-win32-registry
--with-system-zlib --disable-multilib --verbose --with-gcc=gcc-4.2
--with-gnu-ld --with-gnu-as --enable-checking=none --disable-bootstrap
Thread model: posix
gcc version 4.3.0 20070720 (experimental)
# make bench
[snip]
sf.v0

real0m3.963s
user0m3.812s
sys 0m0.152s
sf.v1

real0m3.972s
user0m3.864s
sys 0m0.104s
sf.v2

real0m10.384s
user0m10.261s
sys 0m0.120s
sf.v3

real0m10.390s
user0m10.289s
sys 0m0.104s
sf.v4

real0m10.388s
user0m10.265s
sys 0m0.124s
sf.v5

real0m10.399s
user0m10.281s
sys 0m0.116s

There's some inlining  difference between union variations and the
first two, but they clearly stand in their own league anyway.
So we can only seriously consider the first two.
Variation #0 would ask for invalid c++ (pointer arithmetic abuse, not
an option anymore) or forbidding array access operator and going to
set/get + memcpy, but pretty optimal.
Variation #1 (straight array) is quite annoying in C++ (no initializer
list, need to reformulate all access etc...) and already show some
slight pessimization, but it's not easy to track. Apparently g++ got a
bit better lately in this regard, or it's only blatant on larger data
or more complex cases.

I hope this shows how problematic it is for the end user.
// sphere flake bvh raytracer (c) 2005, thierry berger-perrin [EMAIL PROTECTED]
// this code is released under the GNU Public License.
// see http://ompf.org/ray/sphereflake/
// compile with ie g++ -O2 -ffast-math sphereflake.cc
// usage: ./sphereflake [lvl=6] pix.ppm
#include cmath
#include iostream
#include cstdlib
#include limits
#define GIMME_SHADOWS

enum { childs = 9, ss= 2, ss_sqr = ss*ss }; /* not really tweakable anymore */
static const float infinity = std::numeric_limitsfloat::infinity(), epsilon = 1e-4f;


#if VARIATION == 5
union v_t {
	// straight union; array left unharmed; just as horrible as the others.
	struct { float _x, _y, _z; };
	float f[3];
	v_t(const float a, const float b, const float c) : _x(a), _y(b), _z(c) {}
	float x() const { return _x; }
	float x()  { return _x; }
	float y() const { return _y; }
	float y()  { return _y; }
	float z() const { return _z; }
	float z()  { return _z; }
	
#else
struct v_t {
#endif
#if VARIATION == 0
	// best of the breed, but doesn't give way for an 'array access' operator.
	float _x, _y, _z;
	v_t(const float a, const float b, const float c) : _x(a), _y(b), _z(c) {}
	float x() const { return _x; }
	float x()  { return _x; }
	float y() const { return _y; }
	float y()  { return _y; }
	float z() const { return _z; }
	float z()  { return _z; }
#elif VARIATION == 1
	// not as good, obvious 'array access' but forbids initializer lists
	float f[3];
	v_t(const float a, const float b, const float c) { f[0] = a; f[1] = b; f[2] = c; }
	float x() const { return f[0]; }
	float x()  { return f[0]; }
	float y() const { return f[1]; }
	float y()  { return f[1]; }
	float z() const { return f[2]; }
	float z()  { return f[2]; }
#elif VARIATION == 2
	// Richard Guenther's suggestion, worst of the worst.
	union {
		struct { float x, y, z; } a;
		float b[3];
	} u;
	v_t(const float i, const float j, const float k) { u.a.x = i; u.a.y = j; u.a.z = k; }
	float x() const { return u.a.x; }
	float x()  { return u.a.x; }
	float y() const { return u.a.y; }
	float y()  { return u.a.y; }
	float z() const { return u.a.z; }
	float z()  { return u.a.z; }
 
#elif VARIATION == 3
	// slightly better than variation #2, but still terrible.
	union {
		struct { float _x, _y, _z; };
		float f[3];
	};
	v_t(const float a, const float b, const float c) : _x(a), _y(b), _z(c) 

g++ 4.3, troubles with C++ indexing idioms

2007-07-19 Thread tbp

I have that usual heavy duty 3 fp components class that needs to be
reasonably efficient and takes this form for g++
struct vec_t {
float x,y,z;
const float operator()(const uint_t i) const { return *(x + i); }
float operator()(const uint_t i) { return *(x + i); } // -- guilty
[snip ctors, operators  related cruft]
};

I use this notation because g++ does silly things with straight arrays
(and C++ gets in the way), doesn't like
union vec_t {
struct { float x,y,z; };
float f[3];
const float operator()(const uint_t i) const { return m[i]; }
float operator()(const uint_t i) { return m[i]; }
};
much either, and seems to enjoy the first form (+ ctors with
initializer lists) much. So far, so good.

Alas, somewhere between gcc-4.3-20070608 (ok) and gcc-4.3-20070707
(not ok ever since), the non const indexing started to trigger bogus
codegen with some skipped stores on x86-64, but of course only in
convoluted situations. So, i can't produce a simple testcase. I can
kludge around either by:
. marking it __attribute__((noinline))
. turning it into a set operation doing a std::memcpy(x + i, f,
sizeof(float))
. annoying the optimizer with the entertaining
union vec_t {
struct { float x,y,z; };
float f[3];
const float operator()(const uint_t i) const { return *(x + i); }
float operator()(const uint_t i) { return *(x + i); }
};

At this point i'd need some guidance from compiler developers because
the compiler itself provides none (no warning whatsoever in any of
those variations) and what i thought was acceptable apparently isn't
anymore.
What kind of idiom am i supposed to write such thing in to get back
efficient and correct code?


Re: g++ 4.3, troubles with C++ indexing idioms

2007-07-19 Thread tbp

On 7/19/07, Richard Guenther [EMAIL PROTECTED] wrote:

Well, I always used the array variant, but you should be able to do

[snip]

if you need to (why does the array form not work for you?)

Because if you bench in some non trivial program, on x86/x86-64 at
least, those variations (struct { float x,y,z; }, struct { float f[3];
} and some additional union layer) the last 2 consistently come out as
slower. In the array case addressing seems to be the main issue
(redundant scaling etc...); for the union variant, it's less clear but
it seems it prohibits some copy/return value optimizations.
Plus gcc apparently likes (well, used to) very much the *(x + i)
idiom; all in all i had something to work with.

Now i'm seeing *some* stores indexed in this way vanish, array
addressing is still as bad as it was, unions still get me some
pessimization and using the memcpy idiom asks me to give up on the
idea of an array acces operator altogether.

So i'm asking, which is going to be fixed in the foreseeable future.


Re: g++ 4.3, troubles with C++ indexing idioms

2007-07-19 Thread tbp

On 7/19/07, Richard Guenther [EMAIL PROTECTED] wrote:

Of course, if any then the array indexing variant is fixed.  It would be nice
to see a complete testcase with a pessimization, maybe you can file
a bugreport about this?

By essence they're hard to trigger in small testcases (that's not
where they matter anyway), and by my own previous experience large
hairy bug reports get forgotten on the side of the road.
But i'll see if can make up something convincing, provided i got the
cause for the relative slowdown right.


Re: g++ 4.3, troubles with C++ indexing idioms

2007-07-19 Thread tbp

On 7/19/07, Dave Korn [EMAIL PROTECTED] wrote:

  Bogus codegen is the inevitable result of bogus code.  Garbage in, garbage
out.

  BTW, the const indexing is completely undefined too.

That's the kind of answer i'd get from gcc-help and at that point i'd
be none wiser because i already know that. I also know that up to
gcc-4.3-20070608 it was provably giving correct results faster than
any other variants. Being no language lawyer, that's the only metric i
consider.
It's no portability issue either because every compiler asks for a
specific work around; which is quite sad considering how mundane that
code is.


Re: Activate -mrecip with -ffast-math?

2007-06-18 Thread tbp

On 6/18/07, Richard Guenther [EMAIL PROTECTED] wrote:

No, that's not the contract with -ffast-math.  Note that -ffast-math
enables -funsafe-math-optimizations which is allowed to change results
(add/remove rounding operations, contract expressions, do transforms
like a/b to a * 1/b, do transformations that get you bigger errors than
0.5ulp, etc.)

I can't expect a division by a constant to survive -ffast-math
unscathed, but then that's a change in precision and manageable.
Being returned a NaN i'm not supposed to be see for a common case
depending on some transformation is something else, entirely.


 But if i can't expect a mere division by 0, or sqrt of 0 (quite common
 with FTZ/DAZ on) to give me respectively an infinite and 0 and instead
 get a NaN (which i can't filter, you remember?) because of the NR
 round, that's pure madness.

Hm, which particular case are you concerned about (maybe it was mentioned,
but I don't remember the details)?  Note that -ffast-math enables
-ffinite-math-only as well, so the compiler assumes nothing will result in
NaNs or Infs.

Yes and that's why it's such a pain to handle them correctly while in
-ffast-math. But if i generate some, then i get what i've asked for
(and i'm in for a local fix). Fair enough. I'm not going to give up ie
fast  robust SSE ray/aabb slab tests (or ray/plane or...) because of
some arbitrary rule; the hardware handles it just fine (yes there's a
penalty, but then it's way faster than branching).

For example, when doing 1/x and sqrt(x) via reciprocal + NR, you first
get an inf from said reciprocal which then turns to a NaN in the NR
stage but if you correct it by, say, doing a comparison to 0 and a
'and'.
That's what ICC used to do in your back. That's what you'll find page
151 of the amdfam10 optimization manual. Because that's a common case.

As far as i can see, there's no such provision in the current patch.
At the very least provide a mean to look after those NaNs without
losing sanity, like a way to enforce argument order of
min/max[ss|ps|pd] without ressorting to inline asm.


Well - certainly another reason for the Math BOF ;)  We all expect very
different things from -ffast-math or -funsafe-math-optimizations.

You mean fast  unsafe?
I think there's quite a margin between to let someone shoot himself in
the feet and put a gun on his head.


Re: Activate -mrecip with -ffast-math?

2007-06-18 Thread tbp

On 6/18/07, Giovanni Bajo [EMAIL PROTECTED] wrote:

I understand your problems, but let me state that your objections are
totally subjective. *You* need a specific behaviour from -ffast-math
(eg: keep NaN/Inf), but that's not what *I* need. So, we have different
goals.

No. My NaN are my problem. Those generated by gcc, aren't.
At the very least provide a cannonical (efficient) way to filter them
(ie SSE min/max).


Re: Activate -mrecip with -ffast-math?

2007-06-18 Thread tbp

On 6/18/07, Uros Bizjak [EMAIL PROTECTED] wrote:

IMO,  due to limited range of operands for -mrecip pass (inf, -inf);
where 0.0 is excluded, it should be keept out of -ffast-math. There is
no point to fix reciprocals only for 0.0, we need to fix both
conversions for infinity and 0.0, even in -ffast-math.

Indeed there are holes in every direction when you pull in such
transformation, and the cost of plugging every one of them would be
prohibitive; the next batch of c2d supposedly will leave you with ~6
cycles to make it worth for a sqrt.
Of course it only gets worse when you start composing.

My point merely was that, considering one operation, you'd introduce
NaN for a not so special value (0) which, in a *fast* math scenario,
could be produced at any previous stage due to denormal clamping; with
no sane way to take care of.
Again, if you look at prior art (icc, AMD's manual...), that's the
only special case they covered.
Admittedly that's a trade off but not that unreasonable.

Now, an option to remove such transformations from -ffast-math
bag-o-tricks would be fine and would still buy gcc some Spec bragging
rights :)


Re: Activate -mrecip with -ffast-math?

2007-06-18 Thread tbp

On 6/18/07, Richard Guenther [EMAIL PROTECTED] wrote:

Of course there are cases with every optimization enabled by -ffast-math that
can break existing programs.  Just that we know of one case beforehand shouldn't
prevent us from enabling -mrecip at -ffast-math (provided -mno-recip
still works,
regardless if provided before or after -ffast-math).  [We'll at least
get some more
testing coverage this way]

Argh! Please do not make -ffast-math even more of a pain to work with
than it is already.
You have to enable it, on the whole compilation unit, to get anywhere
near decent performance; there's no escape: either you do not turn it
on and everything slows to a crawl, or you pay for not being able to
inline from another unit.

Until now, the contract was: you have to deal with (and contain) NaN
and infinities. Fair enough, even if tricky that remained manageable.
But if i can't expect a mere division by 0, or sqrt of 0 (quite common
with FTZ/DAZ on) to give me respectively an infinite and 0 and instead
get a NaN (which i can't filter, you remember?) because of the NR
round, that's pure madness.

So please, for the love of everything's sacred, leave such stunts out
of  -ffast-math.

PS: and it's not like such reciprocals + NR couldn't be done with
intrinsics or easily handle such common case.


x86-64 -mcx16, picky __sync_val_compare_and_swap?

2007-04-02 Thread tbp

While doing (or trying to) some cleanup thanks to -mcx16, i've been a
bit surprised that
-- cut --
typedef int TItype __attribute__ ((mode (TI)));
TItype m_128;

void test(TItype x_128)
{
m_128 = __sync_val_compare_and_swap (m_128, x_128, m_128);
}

#include xmmintrin.h
typedef __m128i foo_t;
//typedef TItype foo_t;
foo_t foo;

void test2(foo_t x_128)
{
foo = __sync_val_compare_and_swap (foo, x_128, foo);
}

int main() { return 0; }
-- cut --

# /usr/local/gcc-4.3-20070323/bin/gcc -O2 -mcx16 xchg16.c -o xchg16
xchg16.c: In function 'test2':
xchg16.c:16: error: incompatible type for argument 1 of
'__sync_val_compare_and_swap'

# /usr/local/gcc-4.3-20070323/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --prefix=/usr/local/gcc-4.3-20070323
--enable-languages=c++ --enable-threads=posix --with-system-zlib
--enable-__cxa_atexit --disable-checking --disable-nls
--disable-multilib --enable-bootstrap --with-gcc --with-gnu-as
--with-gnu-ld
Thread model: posix
gcc version 4.3.0 20070323 (experimental)

Am i just wrong believing that ought to work?


Re: x86-64 -mcx16, picky __sync_val_compare_and_swap?

2007-04-02 Thread tbp

On 4/2/07, Richard Henderson [EMAIL PROTECTED] wrote:

On Mon, Apr 02, 2007 at 04:23:21PM +0200, tbp wrote:
 Am i just wrong believing that ought to work?

Yes.

It's hard to argue with a terse compiler or maintainer. Perhaps i
should have picked an easier target like
http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html: GCC will
allow any integral scalar or pointer type that is 1, 2, 4 or 8 bytes
in length of which that TItype from the testsuite testcase is not.

In any case thanks for the clarification.


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-29 Thread tbp

On 1/29/07, Mark Mitchell [EMAIL PROTECTED] wrote:

It doesn't need to be a small testcase.  If you have a preprocessed
source file and a command-line, I'm sure one of the GCC developers would
be able to analyze the situation.  We're all good at isolating problems,
even starting with big complicated inputs.

This now known as PR / 30627
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30627

PS: Thanks to Vladimir for his input.


remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread tbp

Let it be clear from the start this is a potshot and while those
trends aren't exactly new or specific to my code, i haven't tried to
provide anything but specific data from one of my app, on
win32/cygwin.

Primo, gcc getting much better wrt inling exacerbates the fact that
it's not as good as other compilers at shrinking the stack frame size,
and perhaps as was suggested by Uros when discussing that point a pass
to address that would make sense.
As i'm too lazy to properly measure cruft across multiple compilers,
i'll use my rtrt app where i mostly control large scale inlining by
hand.
objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl
-ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head
-n 10

msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160
icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396
gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132

That's with msvc8 sp1, icc 9.1.033, g++ 4.3-20070119, each compiler
being configured to optimize as much as possible for speed. That
confirms what i see when checking codegen for specific functions.

Secundo, while i very much appreciate the brand new string ops, it
seems that on ia32 some array initialization cases where left out,
hence i still see oodles of 'movl $0x0' when generating code for k8.
Also those zeroings get coalesced at the top of functions on ia32, and
i have a function where there's 3 pages of those right after prologue.
See the attached 'grep 'movl   $0x0' dump.


movl0.S.bz2
Description: BZip2 compressed data


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread tbp

On 1/28/07, Richard Guenther [EMAIL PROTECTED] wrote:

On 1/28/07, tbp [EMAIL PROTECTED] wrote:
 objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl
 -ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head
 -n 10

 msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160
 icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396
 gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132

It would have been nice to tell us what the particular columns in
this table mean - now we have to decrypt objdump params and
perl postprocessing ourselves.

I should have known better than to post on a sunday morning. Sorry.
That's the sorted 10 largest stack allocations in binaries produced by
each compiler (presuming most everything falls in place).
Each time i verify codegen for a function across all 3, gcc always has
the largest frame by a substantial amount (on ia32). And that's what
that rigorous table is trying to demonstrate ;)

Basically i'm wondering if a stack frame shrinking pass [ ] is
possible, [ ] makes no sense, [ ] has been done, [ ] is planed etc...


(If you are interested in stack size related to inlining you may want
to tune --param large-stack-frame and --param large-stack-frame-growth).

Recently g++ 4.3 has started to complain about warning:  inlining
failed in call to 'xxx': --param large-stack-frame-growth limit
reached [-Winline]. Bumping said large-function-growth by an ungodly
amount did the trick. But it was the sure sign inlining was being
fixed.
There's much less need to babysit it, thanks a lot to whomever wrote
those patches.


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread tbp

On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote:

Actually we do have one stack frame shrinking pass already.  It depends
on where the bloat is comming from - we can pack (with some limitations)
memory used by structures/arrays used by different inline functions or
lexical blocks.  We don't do any packing of spilled registers nor shring
wrapping other compilers sometimes implement.

Ah. So there's already some shrinkage.
I don't think i can blame spilling for all that waste, but then i also
have no idea what that shring wrapping involves.
Also i think it's only a bit worse with C++ where some idioms appear
to cause more trouble.

It would be nice to have a cheat sheet of do and don't :)

It seems my previous obese mail got axed a bit,
http://ompf.org/vault/frontend.ii.bz2
http://ompf.org/vault/rt_render_packet.ii.bz2


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread tbp

On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote:

Also having some testcases showing inlining deffects in GCC would be
very interesting for me.  Now after IPA-SSA has been merged, I plan to
do some retuning of inliner for 4.3 release since a lot has changes
about properties of it's input and it was originally designed to operate
well on IL used by early tree-ssa.

Gcc, well g++ really, used to be so bad at the inlining game, ie
single op functions/ctors suddendly left out, there was no other
options than to explicitly direct inlining if one cared about
performance. So i don't have much to show, for what i monitored wasn't
under g++ juridiction.
Now i know it has improved (much) because obviously other parts are
being stressed.


Considering information about stack frame size in the inlining costs is
one of things I believe we should do but it is also dificult to tune
without interesting testcases for it.

I have no idea what would make such testcase interesting to you.
But i can try.
You'll find 2 preprocessed GPLed sources attached with

frontend.cc, app::frontend_loop()
(i don't particularly care about that function, but on ia32 - x86-64
is immune - g++ is quite creative about it (large frame, oodles of
upfront zeroing, even if it's a bit better with the gcc-4.3-20070119
snapshot))
frame size, msvc 1152 bytes, icc 2108, g++ 2604

rt_render_packet.cc, horde::grunt_render_tiles_packet(...)
(this one i care about, inlining is controlled)
frame size, msvc 1688, icc 1804, gcc 1932
Performance wise on that one msvc lags by 25% and gcc has a slight
lead of a couple percent on icc.

note: take 2, http://ompf.org/vault/frontend.ii.bz2
http://ompf.org/vault/rt_render_packet.ii.bz2


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread tbp

On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote:

I am not quite sure what you mean by direct inlining here.  At -O2 G++

Decorating everything in sight with attribute always_inline/noinline
(flatten wasn't an option because it used to be troublesome and not as
'portable' across compilers).


I would be interested to know about obvious mistakes GCC do - GCC now
has logic to set cost of inlining wrapper functions (ie functions
doing just one extra call and casts) to at most 0. It might be
interesting to know if some common scenarios are missed.

I guess i should remove those attribute and see what it looks like.


Well, we are working on it ;)
You can take a look at c++ benchmarks http://www.suse.de/~gcctest the
work is ongoing since cgraph was implemented in 2003, another retunning
happen at about 4.0 timeframe, 4.3 has the SSA based IPA that should be
another improvement.

I'm aware of that progression and some of my code is already being tested
http://www.suse.de/~gcctest/c++bench/raytracer/ ;)

4.2 made a substantial difference for me, and it seems 4.3 is well on
its way (even if it's a bit chaotic at times); IPA when enabled used
to ICE on me and recently started to work, but i've failed to notice a
difference (efficiency wise) yet. I guess i should wait a bit more.

I very much appreciate the string op stuff, and i'm eagerly waiting
for the assume() directive (wink wink).


Thanks, what is definitly most interesting for me is self contained
testcase I can easilly compile and run, like we have tramp3d. I will
definitly take a lok at your testcases, but perhaps only after returning
from trip at next weekend since I am running out of time for all my
TODOs today ;)

It's still very much in flux, but once it stabilizes a bit i'll dump
everything into a self contained black box of doom.


Concerning the frame sizes, we really need some kind of analysis from
where it is comming - ie whether GCC simply inline too much together, or
fail to pack well the structures using existing algorithm or it is
register pressure problem.

I'm out of my league. I know the frontend_loop function isn't as
horrible on x86-64, giving some credit to the register pressure
hypothesis, but then that code isn't doing anything fancy.

For the other function, which heavily uses SSE vector intrinsics, g++
is really doing a good job, if only for the, sometimes, duplicated
structures here  there and the larger frame. But you can rule out
g++'s inlining heuristic as it has no (or shouldn't have) any freedom.

If there's anything i can do, do not hesitate.
And thanks for taking notice.


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread tbp

On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote:

BTW when inlining seems to make so noticeable difference, did you try to
use profile feedback?

Once a year, i try.
But then it boils down to the fact that as a programmer i have no way
to express how/where i want gcc to put its nose into. And i get back
to fixing branches, inlining and unrolling (wink) by hand.


 I'm aware of that progression and some of my code is already being tested
 http://www.suse.de/~gcctest/c++bench/raytracer/ ;)

I see, we didn't seem to make that much progress on this testcase
performance wise yet ;)

It's a silly 100 LOC raytracer and historically g++ already did the
Right Thing[tm] (inlining everything), there's not much left to be
gained.


 For the other function, which heavily uses SSE vector intrinsics, g++
 is really doing a good job, if only for the, sometimes, duplicated
 structures here  there and the larger frame. But you can rule out
 g++'s inlining heuristic as it has no (or shouldn't have) any freedom.

Hmm, so then it should be esither structure packing or regalloc. I will
be able to take a look only after returning from a course.
Honza

Regalloc is a lost cause on ia32 :)
Note that nowadays g++ is up to the point where despite those wastes,
it's still faster to inline it all in one rendering function than
splitting. And i think you can also put gcse on the culprit list.


build failure, gcc-4.3-20070126 snapshot, cygwin

2007-01-27 Thread tbp

[ -f stage_final ] || echo stage3  stage_final
make[1]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo'
make[2]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo'
make[3]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo'
rm -f stage_current
make[3]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo'
make[2]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo'
make[2]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo'
make[3]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libiberty'
make[4]: Entering directory
`/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libiberty/testsuite'
make[4]: Nothing to be done for `all'.
make[4]: Leaving directory
`/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libiberty/testsuite'
make[3]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libiberty'
make[3]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/intl'
make[3]: Nothing to be done for `all'.
make[3]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/intl'
make[3]: Entering directory
`/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/build-i686-pc-cygwin/libiberty'
make[4]: Entering directory
`/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/build-i686-pc-cygwin/libiberty/testsuite'
make[4]: Nothing to be done for `all'.
make[4]: Leaving directory
`/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/build-i686-pc-cygwin/libiberty/testsuite'
make[3]: Leaving directory
`/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/build-i686-pc-cygwin/libiberty'
make[3]: Entering directory
`/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/build-i686-pc-cygwin/fixincludes'
make[3]: Nothing to be done for `all'.
make[3]: Leaving directory
`/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/build-i686-pc-cygwin/fixincludes'
make[3]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libcpp'
test -f config.h || (rm -f stamp-h1  make stamp-h1)
make[3]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libcpp'
make[3]: Entering directory
`/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libdecnumber'
make[3]: Nothing to be done for `all'.
make[3]: Leaving directory
`/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/libdecnumber'
make[3]: Entering directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/gcc'
make[3]: Leaving directory `/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/gcc'
Checking multilib configuration for libgcc...
make[3]: Entering directory
`/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/i686-pc-cygwin/libgcc'
# If this is the top-level multilib, build all the other
# multilibs.
make[4]: Entering directory
`/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/i686-pc-cygwin/libgcc'
if [ -z  ]; then \
  true; \
else \
  rootpre=`${PWDCMD-pwd}`/; export rootpre; \
  srcrootpre=`cd ../../../libgcc; ${PWDCMD-pwd}`/; export srcrootpre; \
  lib=`echo ${rootpre} | sed -e 's,^.*/\([^/][^/]*\)/$,\1,'`; \
  compiler=/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/./gcc/xgcc
-B/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/./gcc/
-B/usr/local/gcc-4.3-20070126/i686-pc-cygwin/bin/
-B/usr/local/gcc-4.3-20070126/i686-pc-cygwin/lib/ -isystem
/usr/local/gcc-4.3-20070126/i686-pc-cygwin/include -isystem
/usr/local/gcc-4.3-20070126/i686-pc-cygwin/sys-include; \
  for i in `${compiler} --print-multi-lib 2/dev/null`; do \
dir=`echo $i | sed -e 's/;.*$//'`; \
if [ ${dir} = . ]; then \
  true; \
else \
  if [ -d ../${dir}/${lib} ]; then \
flags=`echo $i | sed -e 's/^[^;]*;//' -e 's/@/ -/g'`; \
if (cd ../${dir}/${lib}; make AR=ar AR_FLAGS=rc
CC=/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/./gcc/xgcc
-B/cygdrive/d/src/gcc/gcc-4.3-20070126/yo/./gcc/
-B/usr/local/gcc-4.3-20070126/i686-pc-cygwin/bin/
-B/usr/local/gcc-4.3-20070126/i686-pc-cygwin/lib/ -isystem
/usr/local/gcc-4.3-20070126/i686-pc-cygwin/include -isystem
/usr/local/gcc-4.3-20070126/i686-pc-cygwin/sys-include CFLAGS=-g
-fkeep-inline-functions DESTDIR= EXTRA_OFILES= HDEFINES=
INSTALL=/usr/bin/install -c INSTALL_DATA=/usr/bin/install -c -m
644 INSTALL_PROGRAM=/usr/bin/install -c LDFLAGS= LOADLIBES=
RANLIB=ranlib SHELL=/bin/sh prefix=/usr/local/gcc-4.3-20070126
exec_prefix=/usr/local/gcc-4.3-20070126
libdir=/usr/local/gcc-4.3-20070126/lib
libsubdir=/usr/local/gcc-4.3-20070126/lib/gcc/i686-pc-cygwin/4.3.0
tooldir=/usr/local/gcc-4.3-20070126/i686-pc-cygwin \
CFLAGS=-g -fkeep-inline-functions ${flags} \
CCASFLAGS= ${flags} \
FCFLAGS= ${flags} \
FFLAGS= ${flags} \
ADAFLAGS= ${flags} \
prefix=/usr/local/gcc-4.3-20070126 \
exec_prefix=/usr/local/gcc-4.3-20070126 \
GCJFLAGS= ${flags} \
CXXFLAGS=-g -O2  ${flags} \
LIBCFLAGS=-g -fkeep-inline-functions ${flags} 

Re: fancy x87 ops, SSE and -mfpmath=sse,387 performance

2006-08-07 Thread tbp

On 8/6/06, Paolo Bonzini [EMAIL PROTECTED] wrote:

 Is there a way to enable such exotic codegen for 32bit environments?

With libgcc-math you didn't have exotic instructions, but you had
trascendental operations compiled with -mfpmath=sse and with a special
ABI.  -mfpmath=sse won about 8% over -mfpmath=387 for tramp3d, which
does have trascendental operations.

Let's see what happens for 4.3.

I'm not sure i groked the fuss about libgcc-math.
What i know is that -mfpmath=sse in recent gcc does wonders, just like
SSE implementations of such library calls as i can experience them in
a sane environment like linux x86-64. But it's truely horrible in
cygwin and off the mark by an order of magnitude.

My complaint is that atm the only stopgap on such platform is to
ressort to -mfpmath=sse,387 which is not without drawbacks.

I understand -march=k8 -mfpmath=sse -mfancy-math-387 is out of
question, but could clarify what i should expect from 4.3?


fancy x87 ops, SSE and -mfpmath=sse,387 performance

2006-08-06 Thread tbp

Basically i'd like to have the cake and also eat it.

With g++-4.2-20060805/cygwin on a k8 box on some software path with
lots of sp float ops but no transcendentals or library calls
-mfpmath=sse,387: 5.2 Mray/s
-mfpmath=sse: 6 Mray/s
That 15% performance difference is no surprise when you see things like
 4037c8:   flds   0x4(%esp)
 4037cc:   mulss  %xmm5,%xmm2
 4037d0:   fsubrp %st,%st(1)
 4037d2:   movss  %xmm1,0x4(%esp)
 4037d8:   addss  0x278(%esp,%ecx,4),%xmm0
 4037e1:   flds   0x4(%esp)
 4037e5:   fsubrp %st,%st(1)
 4037e7:   addss  %xmm2,%xmm0
 4037eb:   movss  %xmm0,0x4(%esp)
 4037f1:   flds   0x4(%esp)
 4037f5:   fdivrp %st,%st(1)
 4037f7:   fcomi  %st(1),%st
 4037f9:   fldz
 4037fb:   setae  %dl
 4037fe:   fcomip %st(1),%st
 403800:   seta   %al
 403803:   or %al,%dl
 403805:   je 4036ca

Therefore -mfpmath=sse is the way to go and is in fact on par or
better than what i get out of icc 9.1 for the same code.
Where it gets ugly is when, for example, you throw some cosf() into
the same compilation unit as with -mfpmath=sse you pay for some really
really slow library function calls (at least on cygwin).
Wishful thinking got me trying -march=k8 -mfpmath=sse
-mfancy-math-387, to no avail :(
Is there a way to enable such exotic codegen for 32bit environments?


g++ 4.2, cygwin, NUMA awareness issue

2006-07-31 Thread tbp

As i don't know which party (g++, stdc++, cygwin) to put the blame on
i'll start here.
I've traced back a weird performance issue to a 'new' returning non
cpu-local memory but only when the binary is launched from the
shell/console. That suggests some crt friction.
(threads where those allocations happen are properly binded to one cpu)

That's on xp sp2, on a bi-k8 box with
Using built-in specs.
Target: i686-pc-cygwin
Configured with: ../configure --prefix=/usr/local/gcc-4.2-20060624
--enable-languages=c,c++ --enable
-threads=posix --with-system-zlib --disable-checking --disable-nls
--disable-shared --disable-win32-
registry --verbose --enable-bootstrap --with-gcc --with-gnu-ld
--with-gnu-as --with-cpu=k8
Thread model: posix
gcc version 4.2.0 20060624 (experimental)

Does that ring a bell or shall i move along the chain? :)


Re: Optimize flag breaks code on many versions of gcc (not all)

2006-06-19 Thread tbp

On 6/19/06, Richard Guenther [EMAIL PROTECTED] wrote:

Using -mfpmath=sse -msse2 is a workaround if you have a processor that supports
SSE2 instructions.  As opposed to -ffloat-store, it works reliably and
with no performance
impact.

Such slab test can be turned into a branchless sequence of SSE
min/max, even for filtering infinities around dir ~= 0; it's much
simpler and efficient to intersect 4 rays against one box at once
though.
Without intrinsics a NaN oblivious version would be like:

static float minf(const float a, const float b) { return (a  b) ? a : b; }
static float maxf(const float a, const float b) { return (a  b) ? a : b; }

bool_t intersect_ray_box(const aabb_t box, const rt::mono::ray_t
ray, float lmin, float lmax)
{
float
l1  = (box.min.x - ray.pos.x) * ray.inv_dir.x,
l2  = (box.max.x - ray.pos.x) * ray.inv_dir.x;
lmin= minf(l1,l2);
lmax= maxf(l1,l2);

l1  = (box.min.y - ray.pos.y) * ray.inv_dir.y;
l2  = (box.max.y - ray.pos.y) * ray.inv_dir.y;
lmin= maxf(minf(l1,l2), lmin);
lmax= minf(maxf(l1,l2), lmax);

l1  = (box.min.z - ray.pos.z) * ray.inv_dir.z;
l2  = (box.max.z - ray.pos.z) * ray.inv_dir.z;
lmin= maxf(minf(l1,l2), lmin);
lmax= minf(maxf(l1,l2), lmax);

return (lmax = lmin)  (lmax = 0.f);
}


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread tbp
On 3/13/06, Paolo Bonzini [EMAIL PROTECTED] wrote:
Wait wait.  PR/21195 is about inlining
 the SSE builtins.
No. PR/21195 was really about inline heuristic going ballistic.
Those intrinsics are thin wrappers around builtins, and ultimately
resolve to a couple of operations. Typical C++ (accessors/ctors) also
presents lots of such small functions.
And guess what, same cause same symptom.

There's no sensible metric by which code i've quoted in previous mail
makes sense. Size? Nope. Execution time? Certainly not.

Again whether or not SSE ops are involved was and is still irrelevant.

 Your case seems to be different, because it involves inlining user
 routines.  Again, you need to give us the preprocessed source code for
 us to look at your bug effectively.
Thanks for the tip, but i'll pass. I've done my duty already.
Months ago there was 2 options for fixing PR/21195:
a) Fix the inlining heuristic.
b) Kludge all intrinsics with always_inline.

I've tried to argue a bit but to no avail. So, while you remain
convinced everything's fine with the  inliner, i'll keep tagging every
function in my code with always_inline/noinline where performance
matters.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread tbp
On 3/13/06, Richard Guenther [EMAIL PROTECTED] wrote:
  Starting with gcc 4.1.0 we have inline heuristics in place that will _always_
 inline such simple wrappers.  So, if this still happens, there is a bug in 
 the
 heuristics and that should be reported.  Before 4.1.0 the heuristics were 
 bogus
 and wrappers were not inlined all the time.
 So, can you verify you are happy with the heuristics in 4.1.0
No i'm not, and i've used a pristine 4.1.0 in
http://gcc.gnu.org/ml/gcc/2006-03/msg00410.html
I haven't tried that particular testcase on 4.2.x, but some weeks ago
i had to go thru all my code again to put always_inline in some
forgotten places because i was seeing even empty ctors not being
inlined (to the effect of having a call to a ret). So in this regard,
4.1.0  4.2.x still exhibit that kind of behaviour.

It seems to trigger when some particular threshold is met, either for
a function or unit, then nothing at all gets inlined but functions
tagged with always_inline; of course major performance regression
ensues.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread tbp
On 3/13/06, Richard Guenther [EMAIL PROTECTED] wrote:
 Of course from 4.1.0 on you can easier stick an
 __attribute__((flatten)) on the function you want everything inlined to
 (finalblow) and get everything inlined into it.
But that's not really what i'm after: i expect trivial functions to
get inlined no matter what at a given -Ox.

 With always_inline on it, the wrappers are no longer inlined - this is a bug 
 and
 should be reported.
 Can you report a bugzilla for the bad interaction between always_inline
 and inlining of simple wrappers?
I will report it again then.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread tbp
On 3/13/06, Richard Guenther [EMAIL PROTECTED] wrote:
 I see the bug and will have a fix in a moment.
You made my day. Or you're about to. Unless you're lying and i'll have
to curse you for 7 generations.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread tbp
On 3/13/06, Richard Guenther [EMAIL PROTECTED] wrote:
 http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00739.html
/me ventilates.
You're my hero.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread tbp
On 3/13/06, tbp [EMAIL PROTECTED] wrote:
 On 3/13/06, Richard Guenther [EMAIL PROTECTED] wrote:
  http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00739.html
 /me ventilates.
 You're my hero.
A double+ hero on top of that.
http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00737.html
I think i've hit that one that one too; reported here:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26650

Well, i can always dream.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread tbp
On 3/13/06, Richard Guenther [EMAIL PROTECTED] wrote:
 I don't think this is related, and a quick check with the patch shows
 still unaligned
 moves to the stack.
Patience is a virtue i guess :)
Is there good chances your inlining fix will hit mainline soon?


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread tbp
On 3/12/06, Steven Bosscher [EMAIL PROTECTED] wrote:
  Yes, why is the benchmark not valid?

 It is valid.  We should understand why this behavior has changed so 
 drastically.
This benchmark maybe useless, it still exposes a weakness of gcc4. At
least it's not news to me:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21195

So that PR has been closed when gcc-devs marked all those intrinsics
as force_inline. That's also the kludge i use with my code. The real
problem is once you start marking some functions as force_inline, you
upset the inlining heuristic even more creating even more silly
inlining misses, rince, repeat.
At the end of the day, everything is marked either force_inline or
noinline and you'd be better off without a heuristic at all.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread tbp
On 3/13/06, Andrew Pinski [EMAIL PROTECTED] wrote:
 Actually the best way of improving the inline heuristics is to get
 a real testcase (and not some benchmark) where  the inline heuristics
 is messed up.
Ah, you mean a brand new testcase because PR-21195 wasn't good enough?

$ /usr/local/gcc-4.1.0/bin/g++ -v
Using built-in specs.
Target: i686-pc-cygwin
Configured with: ../configure --prefix=/usr/local/gcc-4.1.0
--enable-languages=c,c++ --enable-threads=posix --with-system-zlib
--disable-checking --disable-nls --disable-shared
--disable-win32-registry --verbose --enable-bootstrap --with-gcc
--with-gnu-ld --with-gnu-as --with-cpu=k8
Thread model: posix
gcc version 4.1.0

/usr/local/gcc-4.1.0/bin/g++ -g -O3 -march=k8 -msse2   -o pr-inline.o
pr-inline.cc

#include xmmintrin.h

static __m128 mm_max_ps(const __m128 a, const __m128 b) { return
_mm_max_ps(a,b); }
static __m128 mm_min_ps(const __m128 a, const __m128 b) { return
_mm_min_ps(a,b); }
static __m128 mm_mul_ps(const __m128 a, const __m128 b) { return
_mm_mul_ps(a,b); }
static __m128 mm_div_ps(const __m128 a, const __m128 b) { return
_mm_div_ps(a,b); }
static __m128 mm_or_ps(const __m128 a, const __m128 b) { return
_mm_or_ps(a,b); }
static int mm_movemask_ps(const __m128 a) { return _mm_movemask_ps(a); }

static __attribute__ ((always_inline)) bool bloatit(const __m128 a, const
__m128
b) {

const __m128
v0 = mm_max_ps(a,b),
v1 = mm_min_ps(a,b),
v2 = mm_mul_ps(a,b),
v3 = mm_div_ps(a,b),
g0 = mm_or_ps(_mm_or_ps(_mm_or_ps(v0,v1), v2), v3),
v4 = mm_min_ps(mm_or_ps(a,b),mm_div_ps(b,a)),
v5 = mm_max_ps(mm_min_ps(a,mm_div_ps(b,a)), 
mm_or_ps(b, mm_div_ps(b,g0))),
g1 = mm_or_ps(g0,mm_or_ps(v4,v5));
return mm_movemask_ps(g1);
}

bool finalblow(const __m128 a, const __m128 b, const __m128 c, const __m128 d,
const __m128 e, const __m128 f) {
return
bloatit(a,b)  bloatit(c,d)  bloatit(e,f)  bloatit(a,c) 
bloatit(b,d)  bloatit(c,e)  bloatit(d,f) 
bloatit(b,a)  bloatit(d,c)  bloatit(f,e)  bloatit(c,a) 
bloatit(d,b)  bloatit(e,c)  bloatit(f,d);

}

int main() { return 0; }

00401080 mm_mul_ps(float __vector, float __vector):
  401080:   push   %ebp
  401081:   mulps  %xmm1,%xmm0
  401084:   mov%esp,%ebp
  401086:   sub$0x8,%esp
  401089:   leave
  40108a:   ret
  40108b:   nop
  40108c:   lea0x0(%esi),%esi

00401090 mm_or_ps(float __vector, float __vector):
  401090:   push   %ebp
  401091:   orps   %xmm1,%xmm0
  401094:   mov%esp,%ebp
  401096:   sub$0x8,%esp
  401099:   leave
  40109a:   ret
  40109b:   nop
  40109c:   lea0x0(%esi),%esi

004010a0 mm_div_ps(float __vector, float __vector):
  4010a0:   divps  %xmm1,%xmm0
  4010a3:   push   %ebp
  4010a4:   mov%esp,%ebp
  4010a6:   sub$0x8,%esp
  4010a9:   leave
  4010aa:   ret
  4010ab:   nop

...
004010e0 finalblow(float __vector, float __vector, float __vector,
float __vector, float __vector, float __vector):
...
  401101:   call   4010c0 mm_max_ps(float __vector, float __vector)
  401106:   movaps %xmm0,0xf958(%ebp)
  40110d:   movaps 0xf8f8(%ebp),%xmm1
  401114:   movaps 0xf908(%ebp),%xmm0
  40111b:   call   4010b0 mm_min_ps(float __vector, float __vector)
  401120:   movaps 0xf8f8(%ebp),%xmm1
  401127:   movaps %xmm0,0xf948(%ebp)
  40112e:   movaps 0xf908(%ebp),%xmm0
  401135:   call   401080 mm_mul_ps(float __vector, float __vector)
  40113a:   movaps 0xf8f8(%ebp),%xmm1
  401141:   movaps %xmm0,0xf938(%ebp)
  401148:   movaps 0xf908(%ebp),%xmm0
  40114f:   call   4010a0 mm_div_ps(float __vector, float __vector)
  401154:   movaps 0xf958(%ebp),%xmm1
  40115b:   orps   0xf948(%ebp),%xmm1
  401162:   movaps %xmm1,0xf958(%ebp)
  401169:   movaps %xmm0,%xmm1
  40116c:   movaps 0xf958(%ebp),%xmm0
  401173:   orps   0xf938(%ebp),%xmm0
  40117a:   call   401090 mm_or_ps(float __vector, float __vector)
  40117f:   movaps 0xf908(%ebp),%xmm1
  401186:   movaps %xmm0,0xf928(%ebp)
  40118d:   movaps 0xf8f8(%ebp),%xmm0
  401194:   call   4010a0 mm_div_ps(float __vector, float __vector)
  401199:   movaps 0xf8f8(%ebp),%xmm1


g++ 4.1.0/4.2.x, x86/x86-64, segfaults due to bogus SSE alignments

2006-03-10 Thread tbp
This bug is really transient, and AFAIK i only trigger it when using
the cluebat on g++, that is bloating every function in sight
appropriately with always_inline/noinline attributes, in a unit that
inflates much.

Tracked one occurence to something like that:
union float4_t {
float   f[4];
__m128  v;
...
};

static void foobar() {
float4_t __attribute__((aligned (16))) bar;
...
__m128 foo;
...
bar = foo;
}

If i let g++ decide if foobar() should be inlined or not, everything's
fine (but performance of course). Then if i force_inline foobar() i
may or may not get something to the effect of:
  40666a:   movaps %xmm0,0x348(%esp)
  406672:   mov0x348(%esp),%eax
  406679:   mov%eax,0x310(%esp)
  406680:   mov0x34c(%esp),%eax
  406687:   movaps 0x210(%esp),%xmm0
  40668f:   mov%eax,0x314(%esp)
  406696:   mov0x350(%esp),%eax
  40669d:   movaps %xmm0,0x40(%esp)
  4066a2:   mov%eax,0x318(%esp)
  4066a9:   mov0x354(%esp),%eax

Why that value gets suddenly copied around, i don't know. It doesn't
matter much anyway, as the program won't survive past the bogus store.

It's not just related to that kind of mixed unions either, and again
it clearly depends on surrounding functions being force_inlined and
noinlined and lots of stuff ending up on the stack.

I can trigger it on cygwin and linux, with g++ 4.1.0 and various 4.2.x
and once triggered using -0s or -Ox doesn't matter; it's been there
for a long time but that's the first time i can track it down somehow
(inlining heuristics being extremly anyway).

I haven't made a bugreport yet, as that would require disclosing large
amounts of code, but i'd like to know if it's a known issue by any
chance.

Regards,
tbp.


Re: g++ 4.1.0/4.2.x, x86/x86-64, segfaults due to bogus SSE alignments

2006-03-10 Thread tbp
On 3/11/06, Daniel Jacobowitz [EMAIL PROTECTED] wrote:
 Unlikely, since you haven't described at all what the problem is.
 That's why we prefer bug reports with testcases.
...segfaults due to bogus SSE alignments
40666a:   movaps %xmm0,0x348(%esp)


g++ 4.2.x and (auto) inlining

2006-02-25 Thread tbp
Hello,

i've just experienced a 40%+ run-time performance drop that, in fine,
was due to g++ refusing to auto-inline trivial ctors and the like in a
cramped unit (featuring no and forced inlines). That's not the first
time i meet that snafu, but what kinda surprises me is the fact that
i've recently removed more code than added (granted, that doesn't mean
much) and that unit was already being compiled with: --param
inline-unit-growth=1 --param max-inline-insns-recursive=1.

I had to bump both by an order of magnitude to get things flying
again. Even if it works, i'm a bit worried that in some not too
distant future i may run out of digits.
That was with gcc version 4.2.0 20060204 and i was wondering if
semi-recently, g++ behaviour regarding auto inlining had been tweaked
or something.

In any case, if there's a better stopgap that doesn't imply
explicitely force inlining everything in sight, i'd like to know. Or
if there's something in the work.

Regards,
tbp.


Re: x86-64, I definitely can't make sense out of that

2006-02-04 Thread tbp
On 2/4/06, Andrew Pinski [EMAIL PROTECTED] wrote:
 Dale Johannesen and I came up with a patch to the C++ front-end
 for this except it did not work with some C++ cases.
Ah, so i'm not totally inane.
Is there a PR i can track for this one?


x86-64, I definitely can't make sense out of that

2006-02-03 Thread tbp
As i coulnd't understand why g++ insisted on spitting movq $0, stack
only to rewrite the same place a few cycles behind (with a different
width), i've made a testcase and now 20mn later i'm even more puzzled.

#include xmmintrin.h
#include stdio.h

struct dir_t { __m128 x,y,z; };

int creative_codegen(const struct dir_t *dir) {
const int
sx = _mm_movemask_ps(dir-x), sy = _mm_movemask_ps(dir-y), sz =
_mm_movemask_ps(dir-z),
signs_all[4] = { !(sx  0), !(sy  0), !(sz  0),  0 },
coherent = (((sx == 0) | (sx == 15))  ((sy == 0) | (sy == 15)) 

((sz == 0) | (sz == 15)));

if (coherent) { int i; for (i=0; i4; ++i) printf(%d,signs_all[i]); }
return coherent;
}

int main(int argc, void **argv) { return creative_codegen((struct
dir_t*)argv); }

with g++ -O2 (4.0.3, 4.2.0 20060121)
[...]
  40056d:   movq   $0x0,0x10(%rsp) # ?
  400576:   movq   $0x0,0x18(%rsp) # ??
[...]
  40058b:   movq   $0x0,(%rsp) # ???
  400593:   movq   $0x0,0x8(%rsp) # ok
[...]
  40059f:   mov%eax,0x10(%rsp) # ok
[...]
  4005b1:   mov%eax,0x14(%rsp) # ok
[...]
  4005c4:   mov%eax,0x18(%rsp) # ok

If compiled  with gcc, there's no such preliminary movq.

So the question is, what is so obviously flying way over my head?


Re: x86-64 linux, gomp ICE in trunk

2006-01-26 Thread tbp
On 1/25/06, Diego Novillo [EMAIL PROTECTED] wrote:
 You'll need to do what this message suggests.  http://gcc.gnu.org/bugzilla/
Sorry for the lag.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=25983


x86-64 linux, gomp ICE in trunk

2006-01-24 Thread tbp
Hello,

I wanted to play a bit with OpenMP after fighting a (long) while to
get a 4.2 snapshot compiled on my debian64 box... alas...

fresh svn checkout, multilib disabled because it's a no go on my box.
# /usr/local/gomp/bin/g++ -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --prefix=/usr/local/gomp
--enable-languages=c++ --enable-threads=posix --with-system-zlib
--enable-__cxa_atexit --disable-multilib --enable-bootstrap --with-gcc
--with-gnu-as --with-gnu-ld
Thread model: posix
gcc version 4.2.0 20060124 (experimental)

testcase:
int toto() {
int a=0;
#pragma omp single
{
for (int i=0; i10; ++i)
a += i;
}
return a;
}
int main() { return toto(); }

/usr/local/gomp/bin/g++ -fopenmp main.cc -o omp
main.cc: In function 'int toto()':
main.cc:5: internal compiler error: in cp_parser_pragma, at cp/parser.c:17629
Please submit a full bug report,
with preprocessed source if appropriate.
See URL:http://gcc.gnu.org/bugs.html for instructions.

Command line options or the precise omp pragma used don't really
matter, i get a crash on any valid omp directive; gcc-4.2-20060121
is ICE happy the same way.

As a side note while trying to get the compiler built with some debug
info, i've hit a case where it couldn't libgomp.spec once installed (a
--disable-shared configuration).

If there's a workaround that would make my day :)


Re: x86-64 linux, gomp ICE in trunk

2006-01-24 Thread tbp
On 1/25/06, Richard Henderson [EMAIL PROTECTED] wrote:
 c++ gomp is not merged to mainline.
Indeed, that makes up for a solid reason not to work.

Should i hold my breath?


Re: x86-64 linux, gomp ICE in trunk

2006-01-24 Thread tbp
On 1/25/06, Diego Novillo [EMAIL PROTECTED] wrote:
 A couple more weeks, or you can try the gomp branch.
Thanks, will do.
Hopefully i won't fall for the ICE trick that easily next time.


Re: x86-64 linux, gomp ICE in trunk

2006-01-24 Thread tbp
On 1/25/06, Diego Novillo [EMAIL PROTECTED] wrote:
 Well, the compiler still shouldn't ICE.  I'll send a fix shortly.
I know i've exhausted my pseudo-ICE quota for the day, but i have
another candidate knocking at the door with insistence:
src/raytrace_packet.cpp: In member function 'void rt::raytracer_t::prender()':
src/raytrace_packet.cpp:1411: internal compiler error: Segmentation fault
Please submit a full bug report

# /usr/local/gomp/bin/g++ -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --prefix=/usr/local/gomp
--enable-languages=c++ --enable-threads=posix --with-system-zlib
--enable-__cxa_atexit --disable-multilib --enable-bootstrap --with-gcc
--with-gnu-as --with-gnu-ld
Thread model: posix
gcc version 4.2.0-gomp-20050608-branch 20060119 (experimental) (merged 20060119)

While i'm sure i'm terribly wrong one way or another, i'd apreciate
some pointers.


Re: Constant propagation and address arithmetic

2005-05-09 Thread tbp
On 5/8/05, Steven Bosscher [EMAIL PROTECTED] wrote:
 Hi,
Hello,

 I have looked at the GCSE CPROP passes with CSE path following
 disabled (-O1 -fgcse --param max-cse-path-length=1).  The input
 code are the cc1-i files from 20040726 (with checking enabled).
While that discussion flies way above my head, it seems to be about
gcse and i have enough grievance about it to jump in.

I've just pinged PR19680 (because it's still there) and just for the
sake of it i've tried the newly reported PR21463 with -fno-gcse and
it's quite interesting.

as reported, with gcse:
00400610 foo_tfloat::bar_ref(float, float):
  400610:   ucomiss 0x4(%rdi),%xmm1
  400614:   lea0x4(%rdi),%rax
  400618:   lea0xfff8(%rsp),%rdx
  40061d:   movss  %xmm0,0xfffc(%rsp)
  400623:   movss  %xmm1,0xfff8(%rsp)
  400629:   movaps %xmm1,%xmm2
  40062c:   cmova  %rdx,%rax
  400630:   movss  (%rax),%xmm1
  400634:   ucomiss %xmm1,%xmm0
  400637:   ja 400641 foo_tfloat::bar_ref(float, float)+0x31
  400639:   lea0xfffc(%rsp),%rax
  40063e:   movaps %xmm0,%xmm1
  400641:   ucomiss (%rdi),%xmm2
  400644:   cmova  %rdi,%rdx
  400648:   movss  (%rdx),%xmm0
  40064c:   ucomiss %xmm0,%xmm1
  40064f:   jbe400655 foo_tfloat::bar_ref(float, float)+0x45
  400651:   movss  (%rax),%xmm0
  400655:   repz retq

without:
00400610 foo_tfloat::bar_ref(float, float):
  400610:   movss  %xmm0,0xfffc(%rsp)
  400616:   lea0xfff8(%rsp),%rcx
  40061b:   lea0x4(%rdi),%rax
  40061f:   movss  %xmm1,0xfff8(%rsp)
  400625:   lea0xfffc(%rsp),%rdx
  40062a:   ucomiss 0x4(%rdi),%xmm1
  40062e:   cmova  %rcx,%rax
  400632:   ucomiss (%rax),%xmm0
  400635:   cmovbe %rdx,%rax
  400639:   ucomiss (%rdi),%xmm1
  40063c:   movss  (%rax),%xmm0
  400640:   cmovbe %rcx,%rdi
  400644:   ucomiss (%rdi),%xmm0
  400647:   cmova  %rax,%rdi
  40064b:   movss  (%rdi),%xmm0
  40064f:   retq

Again, sorry for hijacking that thread, but gcse is a convenient
scapegoat for most of my performance/codegen problems and i'd like to
know if there's mid-term hope.

Regards,
Thierry.


unexpected speedup from gcc-4.1-20050508

2005-05-09 Thread tbp
Hello,
after setting up the latest snapshot, i was caught off guard as all my
numbers were off (and usually it's better than a swiss clock).
So, i've double checked, stripped some cruft from compiler command
line and pitted various snapshots (20050410, 20050424, 20050501) vs
20050508 in my app.

Now i can say without doubt that on x86-64 linux, on a k8, i reliably
get between 3% (rendering path, mostly vectorized SSE) and 5%
(kd-tree compiler, branchy  memory heavy code) performance boost.
Without touching a single line of code.

I don't know, yet, who's the unsung hero i should thank or what he/she
did, or if that result can be correlated in any other benchmark, but
that won't stop me to send my warmest kudos his/her way.

Feel free to fill in the blanks :)

Regards.


Re: GCC 4.0, Fast Math, and Acovea

2005-05-03 Thread tbp
On 5/3/05, Scott Robert Ladd [EMAIL PROTECTED] wrote:
 tbp wrote:
 Granted, POV-Ray may not be state-of-the-art, but then, I know quite a
 few people who say that (even legitimately) about just about every
 software product in existence.
True. Still, POV has evolved from dkbtrace and it shows sometimes.

 If you have a suggestion for better benchmarks, I'm listening. Is your
 ray tracer available?
It's way too rough for general consumption yet, and quite specialized
anyway (very large geometry).

With specific kludges for each compiler, here's the hierarchy for the
hand vectorized rendering:
ia32:   icc8.1, gcc4.1 (-5% at least), msvc2k3 (-20%)
x86-64: gcc4.1, icc9.0 (-7% at least)
It varies a bit, depending on features being hammered by specific
scenes, but the order is unchanged (note that the x86-64 version has
only been tested on k8 so far).

GCC shows an edge in the SAH kdtree compiler part (branchy code) on
x86-64, with a 40% improvement over the ia32 versions (and icc9.1
which definitely gets lost).
That's more than welcome, given the time it takes to produce those
freaking trees :)

Anecdotically gcc is only one to get the parsing of large memory
mapped files right (or put another way, the idiom used), being 2x
faster than every other compilers on every platform.


Re: GCC 4.0, Fast Math, and Acovea

2005-05-02 Thread tbp
On 5/2/05, Scott Robert Ladd [EMAIL PROTECTED] wrote:
 You might want to a look at my just-published review of GCC 4.0, where I
 compare it's performance on some well-known applications, including LAME
 and POV-Ray, on Pentium 4 and Opteron. In terms of POV-Ray, 4.0 produced
 a smaller executable that was slightly slower than did 3.4.3. You can
 find the full review at:
While POV has an impressive array of features and is quite valuable as
a large FP intensive legacy standard for compiler writers (or
raytracer writers :), i wouldn't consider it state of the art or a
speed daemon either; to put it bluntly it's incredibly slow.

For those reasons i consider it's not representative of the kind of
computationnal performance gcc can extract from a modern CPU at all:
again, in my own experience, gcc4.x is light years away from previous
versions.

Now i'm not familiar enough with the other cited sources to comment.


Re: GCC 4.0, Fast Math, and Acovea

2005-04-30 Thread tbp
On 4/29/05, Uros Bizjak [EMAIL PROTECTED] wrote:
 Hello Scott!
Hello Scott  Uros,
 
  Specifically, the -funsafe-math-optimizations flag doesn't work
  correctly on AMD64 because the default on that platform is
  -mfpmath=sse. Without specifying -mfpmath=387,
  -funsafe-math-optimizations does not generate inline processor
  instructions for most floating-point functions.
[snip]
 It was found that moving data from SSE registers to X87 registers (and
 back) only to call an x87 builtin degrades performance. Because of this,
 x87 builtins are disabled for -mfpmath=sse and a normal libcall is
 issued for sin(), etc functions. If someone wants to use x87 builtins,
 then _all_ math operations should be done in x87 registers to avoid
 costly SSE-x87 moves.

Shameless plug with my own performance analysis regarding SSE on x86-64.
I've ported my coherent raytracer which mostly uses intrinsics in the
hot path (and no transcendentals).
While gcc4.x compiled binaries are ~5% slower than those compiled with
icc8.1 on ia32 (best case), it's the other way around on x86-64 if not
more (on my opteron with icc8.1 and beta 9.0).
Obviously there's much less pressure on the (cough weak cough)
register allocator and in the end the generated code is way leaner.

My only gripe with fast-math is that it's the only way to enable some
optimizations while making NaNs verbotten; couple that with the lack
of cross unit IPO and you're stuck with a kind of nasty global
switch (unless you have room for some function calls).


Re: gcc4, static array, SSE alignement

2005-04-06 Thread tbp
On Apr 6, 2005 3:18 AM, James E Wilson [EMAIL PROTECTED] wrote:
 I would guess a limitation of cygwin binutils support, or perhaps of
 Windows itself.
Binutils, perhaps. Windows certainly not as msvc2k3  icc8.1 don't
have such issue with the same code.

 This seems to work fine on linux.  If I compile a simple example using
 __alignof__, I see that the compiler is assuming 16-byte alignment.  If
 I compile with -S, I see that the compiler is giving them 32-byte
 alignment (probably for better cache alignment).  If I run objdump -x on
 the a.out file, I see that .bss section has 2**5 (32-byte) alignment.
 All is as it should be.
Sections:
Idx Name  Size  VMA   LMA   File off  Algn
  0 .text 0003e754  00401000  00401000  0400  2**4
  CONTENTS, ALLOC, LOAD, READONLY, CODE, DATA
  1 .data 4634  0044  0044  0003ec00  2**4
  CONTENTS, ALLOC, LOAD, DATA
  2 .rdata4884  00445000  00445000  00043400  2**4
  CONTENTS, ALLOC, LOAD, READONLY, DATA
  3 .bss  8fc0  0044a000  0044a000    2**4
  ALLOC
  4 .idata1984  00453000  00453000  00047e00  2**2
  CONTENTS, ALLOC, LOAD, DATA
  5 .stab 00169908  00455000  00455000  00049800  2**2
  CONTENTS, READONLY, DEBUGGING, NEVER_LOAD, EXCLUDE
  6 .stabstr  001c39e1  005bf000  005bf000  001b3200  2**0
  CONTENTS, READONLY, DEBUGGING, NEVER_LOAD, EXCLUDE

Gcc  the toolchain used to have lots of issues wrt alignement, but i
thought they were a thing of the past.
As far as i can see, everything is fine regarding section alignements.

 A real bug report, as per
  http://gcc.gnu.org/bugs.html
 would be useful here, so we can check.  In particular, a testcase to
 reproduce the problem, and exactly what you believe is wrong with the
 output.
Yep, but i was testing the water.
I mean i have lots of other 16 bytes aligned data (static, extern,
const or not and whatnot) in there and the only problematic one is
this semi large static.
So, because that's the largest, i thought i've crossed some magic size
threshold.

I'll try to pinpoint the problem a bit better.


Re: gcc4, static array, SSE alignement

2005-04-06 Thread tbp
On Apr 6, 2005 2:08 PM, tbp [EMAIL PROTECTED] wrote:
 I'll try to pinpoint the problem a bit better.
Alas, since the other day the code using that static array has changed
a bit and i can't reproduce the bug.
So, after all, it really was gcc's fault.

I'll try to dig up the original version.


gcc4, namespace and template specialization problem

2005-04-04 Thread tbp
Hello,

i'm a bit puzzled by the behaviour of gcc4 (old 4.0  recent 4.1
snapshots) regarding how template specialization should be qualified
wrt namespace:

namespace dummy {
struct foo {
template int i void f() {}
};
}
template void dummy::foo::f666() {}

testcase.cpp:30: error: specialization of 'templateint i void
dummy::foo::f()' in different namespace
testcase.cpp:27: error:   from definition of 'templateint i void
dummy::foo::f()'

It has to be written this way:
namespace dummy {
template void dummy::foo::f666() {}
or
template void foo::f666() {}
}

Other compilers (gcc 3.4.x, msvc2k3, icc8.1) don't whine.
Am i missing something obvious?


Re: gcc4, namespace and template specialization problem

2005-04-04 Thread tbp
On Apr 4, 2005 11:54 AM, Nathan Sidwell [EMAIL PROTECTED] wrote:
  Am i missing something obvious?
 well, not 'obvious', but that is what [14.7.3]/2 says.
I especially don't quite get why specialization have to be defined
that way when non specialized version don't have to, ie that is legit:
namespace dummy {
struct foo {
template int i void f();
};
} 
templateint i void dummy::foo::f() { }

But if that's the law...

Thanks for clue.


Re: gcc4, namespace and template specialization problem

2005-04-04 Thread tbp
On Apr 4, 2005 12:21 PM, Nathan Sidwell [EMAIL PROTECTED] wrote:
 That's not a declaration, it's a definition of an already declared fn.
 the case you had was a definition that was _also_ a declaration.
[...]
 See the difference?
Yes, and i know about it...

 Although it is kind of quirky that you can declare member function 
 specializations
 outside of the class, but not outside of the namespace.
.. but that's that inconsistency that puzzled/confused me.
Sorry for the noise, but i don't own a copy of that byzantine standard.


Re: gcc4, namespace and template specialization problem

2005-04-04 Thread tbp
On Apr 4, 2005 12:50 PM, Jonathan Wakely [EMAIL PROTECTED] wrote:
 GCC 3.4 *does* whine, and I think Intel will in strict mode.
Can't get neither gcc 3.4.1 to whine about it (-Wall) nor icc 8.1 with
the highest warning level enabled.