from:"Gabriel Paubert"

Re: a small C (naive) program faster with clang than with gcc

2023-04-26 Thread Gabriel Paubert

On Tue, Apr 25, 2023 at 06:01:22PM +0200, Andy via Gcc wrote:
> I see it in godbolt
> GCC compiles to:
> movsx eax, BYTE PTR [rdi+2]
> cmp al, 9
> ja .L42
> Clang:
> movzx edx, byte ptr [rdi + 2]
> cmp edx, 9
> ja .LBB0_40
> 
> 
> GCC extend with sign, Clang with zero.
> cmp with 32 bit register is apparently faster than 8bit

What happens if you compile with -funsigned-char?

There may be also some alignment issue, after all cmp al,9 is 2 bytes
while cmp edx,9 is 6.

Gabriel

> 
> pon., 24 kwi 2023 o 17:34 Basile Starynkevitch
>  napisał(a):
> >
> > Hello all,
> >
> >
> > Consider the naive program (GPLv3+) to solve the cryptaddition
> >
> > `NEUF` + `DEUX` = `ONZE`
> >
> > onhttps://github.com/bstarynk/misc-basile/blob/master/CryptArithm/neuf%2Bdeux%3Donze/naive0.c
> >   (commit0d1bd0e
> >  >  >)
> >
> >
> > On Linux/x86-64 that source code compiled with gcc-12 -O3 is twice as
> > slower as with clang -O3
> >
> > (Debian/Sid or Ubuntu/22/10)
> >
> > Feel free to add it to some testsuite!
> >
> >
> > Thanks
> >
> >
> > --
> > Basile Starynkevitch
> > (only mine opinions / les opinions sont miennes uniquement)
> > 92340 Bourg-la-Reine, France
> > web page: starynkevitch.net/Basile/ & refpersys.org

Re: -fprofile-update=atomic vs. 32-bit architectures

2022-11-04 Thread Gabriel Paubert

On Fri, Nov 04, 2022 at 09:27:34AM +0100, Sebastian Huber wrote:
> Hello,
> 
> even recent 32-bit architectures such as RISC-V do not support 64-bit atomic
> operations.  Using -fprofile-update=atomic for the 32-bit RISC-V RV32GC ISA
> yields:
> 
> warning: target does not support atomic profile update, single mode is
> selected
> 
> For multi-threaded applications it is quite important to use atomic counter
> increments to get valid coverage data. I think this fall back is not really
> good. Maybe we should consider using this approach from Jakub Jelinek for
> 32-bit architectures lacking 64-bit atomic operations:
> 
>   if (__atomic_add_fetch_4 ((unsigned int *) , 1, __ATOMIC_RELAXED) ==
> 0)
> __atomic_fetch_add_4 (((unsigned int *) ) + 1, 1, __ATOMIC_RELAXED);
> 
> https://urldefense.com/v3/__https://patchwork.ozlabs.org/project/gcc/patch/19c4a81d-6ecd-8c6e-b641-e257c1959...@suse.cz/*1447334__;Iw!!D9dNQwwGXtA!QgLVk_W5VF39jGPn64zfvbJ4IiAGApjLqzW7UkLWWuFD6ya4AAega4z4_tu2YquarSyTIl7qLzWvIefVpXkLKsAaeeIU63MtmQU$
> 
> Last year I added the TARGET_GCOV_TYPE_SIZE target hook to optionally reduce
> the gcov type size to 32 bits. I am not really sure if this was a good idea.
> Longer running executables may observe counter overflows leading to invalid
> coverage data. If someone wants atomic updates, then the updates should be
> atomic even if this means to use a library implementation (libatomic).
> 
> What about the following approach if -fprofile-update=atomic is given:
> 
> 1. Use 64-bit atomics if available.
> 
> 2. Use
> 
>   if (__atomic_add_fetch_4 ((unsigned int *) , 1, __ATOMIC_RELAXED) ==
> 0)
> __atomic_fetch_add_4 (((unsigned int *) ) + 1, 1, __ATOMIC_RELAXED);
> 
> if 32-bit atomics are available.

This assumes little-endian byte order.

Cheers,
Gabriel

> 
> 3. Else use a library call (libatomic).
> 
> -- 
> embedded brains GmbH
> Herr Sebastian HUBER
> Dornierstr. 4
> 82178 Puchheim
> Germany
> email: sebastian.hu...@embedded-brains.de
> phone: +49-89-18 94 741 - 16
> fax:   +49-89-18 94 741 - 08
> 
> Registergericht: Amtsgericht München
> Registernummer: HRB 157899
> Vertretungsberechtigte Geschäftsführer: Peter Rasmussen, Thomas Dörfler
> Unsere Datenschutzerklärung finden Sie hier:
> https://urldefense.com/v3/__https://embedded-brains.de/datenschutzerklaerung/__;!!D9dNQwwGXtA!QgLVk_W5VF39jGPn64zfvbJ4IiAGApjLqzW7UkLWWuFD6ya4AAega4z4_tu2YquarSyTIl7qLzWvIefVpXkLKsAaeeIUo5lh3vs$

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-06 Thread Gabriel Paubert

On Fri, Aug 06, 2021 at 02:43:34PM +0200, Stefan Kanthak wrote:
> Gabriel Paubert  wrote:
> 
> > Hi,
> > 
> > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> >> Gabriel Paubert  wrote:
> >> 
> >> 
> >> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> 
> >> >>   .intel_syntax
> >> >>   .text
> >> >>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = 
> >> >> trunc(argument)
> >> >>5:   48 f7 d8  neg rax
> >> >> # jz  .L0  # argument zero?
> >> >>8:   70 16 jo  .L0  # argument 
> >> >> indefinite?
> >> >># argument overflows 
> >> >> 64-bit integer?
> >> >>a:   48 f7 d8  neg rax
> >> >>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = 
> >> >> trunc(argument)
> >> >>   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
> >> >>   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & 
> >> >> -0.0) ? -0.0 : 0.0
> >> >>   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = 
> >> >> trunc(argument)
> >> >>   20:   c3  .L0:  ret
> >> >>   .end
> >> > 
> >> > There is one important difference, namely setting the invalid exception
> >> > flag when the parameter can't be represented in a signed integer.
> >> 
> >> Right, I overlooked this fault. Thanks for pointing out.
> >> 
> >> > So using your code may require some option (-fast-math comes to mind),
> >> > or you need at least a check on the exponent before cvttsd2si.
> >> 
> >> The whole idea behind these implementations is to get rid of loading
> >> floating-point constants to perform comparisions.
> > 
> > Indeed, but what I had in mind was something along the following lines:
> > 
> > movq rax,xmm0   # and copy rax to say rcx, if needed later
> > shrq rax,52 # move sign and exponent to 12 LSBs 
> > andl eax,0x7ff  # mask the sign
> > cmpl eax,0x434  # value to be checked
> > ja return   # exponent too large, we're done (what about NaNs?)
> > cvttsd2si rax,xmm0 # safe after exponent check
> > cvtsi2sd xmm0,rax  # conversion done
> > 
> > and a bit more to handle the corner cases (essentially preserve the
> > sign to be correct between -1 and -0.0).
> 
> The sign of -0.0 is the only corner case and already handled in my code.
> Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
> preserved, as in the code GCC generates as well as my code.

I don't know what the standard says about NaNs in this case, I seem to
remember that arithmetic instructions typically produce QNaN when one of
the inputs is a NaN, whether signaling or not. 

> 
> > But the CPU can (speculatively) start the conversions early, so the
> > dependency chain is rather short.
> 
> Correct.
>  
> > I don't know if it's faster than your new code,
> 
> It should be faster.
> 
> > I'm almost sure that it's shorter.
> 
> "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
> 5+4+5+5+2=21 bytes.
> 
> JFTR: better use "add rax,rax; shr rax,53" instead of
>   "shr rax,52; and eax,0x7ff" and save 2 bytes.

Indeed, I don't have the exact size of instructions in my head,
especially since I've not written x86 assembly since the mid 90s.

In any case, with your last improvement, the code is now down to a
single 32 bit immediate constant. And I don't see how to eliminate it...

> 
> Complete properly optimized code for __builtin_trunc is then as follows
> (11 instructions, 44 bytes):
> 
> .code64
> .intel_syntax
> .equBIAS, 1023
> .text
> movqrax, xmm0# rax = argument
> add rax, rax
> shr rax, 53  # rax = exponent of |argument|
> cmp eax, BIAS + 53
> jae .Lexit   # argument indefinite?

Maybe s/.Lexit/.L0/

>  # |argument| >= 0x1.0p53?
> cvttsd2si rax, xmm0  # rax = trunc(argument)
> cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
> psrlq   xmm0, 63
> psllq   xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0
> orpdxmm0, xmm1   # xmm0 = trunc(argument)
> .L0:ret
> .en

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-05 Thread Gabriel Paubert

Hi,

On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> Gabriel Paubert  wrote:
> 
> 
> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> >> Hi,
> >> 
> >> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> >> following code (13 instructions using 57 bytes, plus 4 quadwords
> >> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> >> 
> >> .text
> >>0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
> >> 4: R_X86_64_PC32.rdata
> >>8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
> >> c: R_X86_64_PC32.rdata
> >>   10:   66 0f 28 d8 movapd %xmm0, %xmm3
> >>   14:   66 0f 28 c8 movapd %xmm0, %xmm1
> >>   18:   66 0f 54 da andpd  %xmm2, %xmm3
> >>   1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
> >>   20:   76 16   jbe38 <_trunc+0x38>
> >>   22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
> >>   27:   66 0f ef c0 pxor   %xmm0, %xmm0
> >>   2b:   66 0f 55 d1 andnpd %xmm1, %xmm2
> >>   2f:   f2 48 0f 2a c0  cvtsi2sd %rax, %xmm0
> >>   34:   66 0f 56 c2 orpd   %xmm2, %xmm0
> >>   38:   c3  retq
> >> 
> >> .rdata
> >> .align 8
> >>0:   00 00 00 00 .LC0:   .quad  0x1.0p52
> >> 00 00 30 43
> >> 00 00 00 00
> >> 00 00 00 00
> >> .align 16
> >>   10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
> >> ff ff ff 7f
> >>   18:   00 00 00 00 .quad  0.0
> >> 00 00 00 00
> >> .end
> >> 
> >> JFTR: in the best case, the memory accesses cost several cycles,
> >>   while in the worst case they yield a page fault!
> >> 
> >> 
> >> Properly optimized, shorter and faster code, using but only 9 instructions
> >> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory 
> >> accesses
> >> and saving at least 16 + 32 bytes, follows:
> >> 
> >>   .intel_syntax
> >>   .text
> >>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
> >>5:   48 f7 d8  neg rax
> >> # jz  .L0  # argument zero?
> >>8:   70 16 jo  .L0  # argument indefinite?
> >># argument overflows 
> >> 64-bit integer?
> >>a:   48 f7 d8  neg rax
> >>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
> >>   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
> >>   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & 
> >> -0.0) ? -0.0 : 0.0
> >>   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = trunc(argument)
> >>   20:   c3  .L0:  ret
> >>   .end
> > 
> > There is one important difference, namely setting the invalid exception
> > flag when the parameter can't be represented in a signed integer.
> 
> Right, I overlooked this fault. Thanks for pointing out.
> 
> > So using your code may require some option (-fast-math comes to mind),
> > or you need at least a check on the exponent before cvttsd2si.
> 
> The whole idea behind these implementations is to get rid of loading
> floating-point constants to perform comparisions.

Indeed, but what I had in mind was something along the following lines:

movq rax,xmm0   # and copy rax to say rcx, if needed later
shrq rax,52 # move sign and exponent to 12 LSBs 
andl eax,0x7ff  # mask the sign
cmpl eax,0x434  # value to be checked
ja return   # exponent too large, we're done (what about NaNs?)
cvttsd2si rax,xmm0 # safe after exponent check
cvtsi2sd xmm0,rax  # conversion done

and a bit more to handle the corner cases (essentially preserve the
sign to be correct between -1 and -0.0). But the CPU can (speculatively) 
start the conversions early, so the dependency chain is rather short.

I don't know if it's faster than your new code, I'm almost sure that
it's shorter. Your new code also has a fairly long dependency chain.

> 
> >

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

2021-08-05 Thread Gabriel Paubert

On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> Hi,
> 
> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> following code (13 instructions using 57 bytes, plus 4 quadwords
> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> 
> .text
>0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
> 4: R_X86_64_PC32.rdata
>8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
> c: R_X86_64_PC32.rdata
>   10:   66 0f 28 d8 movapd %xmm0, %xmm3
>   14:   66 0f 28 c8 movapd %xmm0, %xmm1
>   18:   66 0f 54 da andpd  %xmm2, %xmm3
>   1c:   66 0f 2e e3 ucomisd %xmm3, %xmm4
>   20:   76 16   jbe38 <_trunc+0x38>
>   22:   f2 48 0f 2c c0  cvttsd2si %xmm0, %rax
>   27:   66 0f ef c0 pxor   %xmm0, %xmm0
>   2b:   66 0f 55 d1 andnpd %xmm1, %xmm2
>   2f:   f2 48 0f 2a c0  cvtsi2sd %rax, %xmm0
>   34:   66 0f 56 c2 orpd   %xmm2, %xmm0
>   38:   c3  retq
> 
> .rdata
> .align 8
>0:   00 00 00 00 .LC0:   .quad  0x1.0p52
> 00 00 30 43
> 00 00 00 00
> 00 00 00 00
> .align 16
>   10:   ff ff ff ff .LC1:   .quad  ~(-0.0)
> ff ff ff 7f
>   18:   00 00 00 00 .quad  0.0
> 00 00 00 00
> .end
> 
> JFTR: in the best case, the memory accesses cost several cycles,
>   while in the worst case they yield a page fault!
> 
> 
> Properly optimized, shorter and faster code, using but only 9 instructions
> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
> and saving at least 16 + 32 bytes, follows:
> 
>   .intel_syntax
>   .text
>0:   f2 48 0f 2c c0cvttsd2si rax, xmm0  # rax = trunc(argument)
>5:   48 f7 d8  neg rax
> # jz  .L0  # argument zero?
>8:   70 16 jo  .L0  # argument indefinite?
># argument overflows 
> 64-bit integer?
>a:   48 f7 d8  neg rax
>d:   f2 48 0f 2a c8cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>   12:   66 0f 73 d0 3fpsrlq   xmm0, 63
>   17:   66 0f 73 f0 3fpsllq   xmm0, 63 # xmm0 = (argument & -0.0) 
> ? -0.0 : 0.0
>   1c:   66 0f 56 c1   orpdxmm0, xmm1   # xmm0 = trunc(argument)
>   20:   c3  .L0:  ret
>   .end

There is one important difference, namely setting the invalid exception
flag when the parameter can't be represented in a signed integer.  So
using your code may require some option (-fast-math comes to mind), or
you need at least a check on the exponent before cvttsd2si.

The last part of your code then goes to take into account the special
case of -0.0, which I most often don't care about (I'd like to have a
-fdont-split-hairs-about-the-sign-of-zero option).

Potentially generating spurious invalid operation and then carefully
taking into account the sign of zero does not seem very consistent.

Apart from this, in your code, after cvttsd2si I'd rather use:
mov rcx,rax # make a second copy to a scratch register
neg rcx
jo .L0
cvtsi2sd xmm1,rax

The reason is latency, in an OoO engine, splitting the two paths is
almost always a win.

With your patch:

cvttsd2si-->neg-?->neg-->cvtsi2sd
  
where the ? means that the following instructions are speculated.  

With an auxiliary register there are two dependency chains:

cvttsd2si-?->cvtsi2sd
 |->mov->neg->jump

Actually some OoO cores just eliminate register copies using register
renaming mechanism. But even this is probably completely irrelevant in
this case where the latency is dominated by the two conversion
instructions.

Regards,
Gabriel



> 
> regards
> Stefan

Re: Need help debugging possible 10.3 bad code generation regression from 10.2/9.3 on Mac OS 10.15.7 (Catalina)

2021-04-20 Thread Gabriel Paubert

On Tue, Apr 20, 2021 at 12:20:06PM +, Lucier, Bradley J via Gcc wrote:
> I’m seeing an “Illegal Instruction” fault and don’t quite know how to 
> generate a proper bug report yet.
> 
> This is the compiler:
> 
> [Bradleys-Mac-mini:~] lucier% /usr/local/gcc-10.3.0/bin/gcc -v
> Using built-in specs.
> COLLECT_GCC=/usr/local/gcc-10.3.0/bin/gcc
> COLLECT_LTO_WRAPPER=/usr/local/gcc-10.3.0/libexec/gcc/x86_64-apple-darwin19.6.0/10.3.0/lto-wrapper
> Target: x86_64-apple-darwin19.6.0
> Configured with: ../../gcc-10.3.0/configure --prefix=/usr/local/gcc-10.3.0 
> --enable-languages=c --disable-multilib --enable-checking=release 
> --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk
> Thread model: posix
> Supported LTO compression algorithms: zlib
> gcc version 10.3.0 (GCC) 
> 
> This is the crash report from the console:
> 
> Exception Type:EXC_BAD_INSTRUCTION (SIGILL)
> Exception Codes:   0x000c, 0x
> Exception Note:EXC_CORPSE_NOTIFY
> 
> Termination Signal:Illegal instruction: 4
> Termination Reason:Namespace SIGNAL, Code 0x4
> Terminating Process:   exc handler [98080]
> 
> Application Specific Information:
> dyld2 mode
> 
> Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
> 0   libgambit.dylib   0x00010dfaf010 
> ___SCMOBJ_to_NONNULLSTRING + 1520 (c_intf.c:3280)
> 
> This is the disassembled code (arrow points to crash point):
> 
> (lldb) di -s 0x000103d6 -c 10
> libgambit.dylib`___SCMOBJ_to_NONNULLSTRING:
> 0x103d6 <+1504>: jl 0x103d60026   ; <+1542> at 
> c_intf.c:3282:9
> 0x103d60002 <+1506>: orb%al, 0x31(%rbp)
> 0x103d60005 <+1509>: shlb   %cl, 0x2e(%rsi)

Does GCC ever generate this last instruction (a variable shift of a
byte in memory!)? Even the next to last (register to memory) is only
generated infrequently.

First thing to do would be to start the disassembly earlier, or even at
the beginning of the function, because I believe that the address you
gave is not an instruction boundary, and in this case the output of the
disassembler is nonsense until it resynchronizes on a real boundary.

Regards,
Gabriel


> 0x103d60008 <+1512>: nopl   (%rax,%rax)
> ->  0x103d60010 <+1520>: movl   (%rbp,%r10,4), %esi
> 0x103d60015 <+1525>: callq  0x103fba9a0   ; symbol stub for: 
> ___UTF_8_put
> 0x103d6001a <+1530>: movq   %r10, %rax
> 0x103d6001d <+1533>: addq   $0x1, %r10
> 0x103d60021 <+1537>: cmpq   %r12, %rax
> 0x103d60024 <+1540>: jne0x103d60010   ; <+1520> at 
> c_intf.c:3280:173
> 
> I don’t know why that particular instruction is “Illegal”.
> 
> Can someone suggest a way forward?
> 
> Thanks.
> 
> Brad

Re: RFC: allowing compound assignment operators with designated initializers

2018-10-15 Thread Gabriel Paubert

On Mon, Oct 15, 2018 at 08:13:19PM +0100, Jonathan Wakely wrote:
> On Mon, 15 Oct 2018 at 20:08, Gabriel Paubert  wrote:
> >
> > On Mon, Oct 15, 2018 at 08:11:42PM +0200, Florian Weimer wrote:
> > > * Jonathan Wakely:
> > >
> > > > On Sun, 14 Oct 2018 at 20:46, Florian Weimer  wrote:
> > > >>
> > > >> * Rasmus Villemoes:
> > > >>
> > > >> > This is something I've sometimes found myself wishing was supported. 
> > > >> > The
> > > >> > idea being that one can say
> > > >> >
> > > >> > unsigned a[] = { [0] = 1, [1] = 3, [0] |= 4, ...}
> > > >> >
> > > >> > which would end up initializing a[0] to 5. As a somewhat realistic
> > > >> > example, suppose one is trying to build a bitmap at compile time, but
> > > >> > the bits to set are not really known in the sense that one can group
> > > >> > those belonging to each index in a usual | expression. Something like
> > > >> >
> > > >> > #define _(e) [e / 8] |= 1 << (e % 8)
> > > >> > const u8 error_bitmap[] = { _(EINVAL), _(ENAMETOOLONG), _(EBUSY), 
> > > >> > ... }
> > > >>
> > > >> I think it wouldn't be too hard to extend std::bitset with more
> > > >> compile-time operations to support this, if that's what you need.
> > > >
> > > > It's already doable using C++17:
> > >
> > > I didn't doubt that, it's just that I'd expect to be able to use
> > > std::bitset for this.
> > >
> > > > template
> > > > constexpr auto
> > > > make_error_bitmap()
> > > > {
> > > >   using std::uint8_t;
> > > >   using std::array;
> > > >   constexpr auto max_index = std::max_element({N...}) / 8;
> > > >   array a;
> > > >   [[maybe_unused]] uint8_t sink[] = { a[N/8] |= (1 << (N%8)), ... };
> > > >   return a;
> > > > }
> > > >
> > > > constexpr uint8_t error_bitmap = make_error_bitmap > > > ENAMETOOLONG, EBUSY>();
> > > >
> > > > (This won't compile in C++14 because std::array can't be modified in a
> > > > constant expression until C++17).
> > >
> > > You wrote that without testing it?  I'm impressed.  It's really close.
> > >
> > > template
> > > constexpr auto
> > > make_error_bitmap()
> > > {
> > >   using std::uint8_t;
> > >   using std::array;
> > >   constexpr auto max_index = std::max({ N... });
> > >   array a{};
> > >   [[maybe_unused]] uint8_t sink[] = { a[N/8] |= (1 << (N%8)) ... };
> > >   return a;
> > > }
> > >
> >
> > Hmm, isn't the array roughly 8 times too large?
> 
> Yes, it looks like I pasted the wrong version of  the code, which is
> why  it had the stray comma that Florian corrected. The final version
> of the code I actually wrote is:
> 
> #include 
> #include 
> #include 
> 
> template
> constexpr auto
> make_error_bitmap()
> {
>   using std::uint8_t;
>   using std::array;
>   constexpr auto max_index = std::max({N...}) / 8;
>   array a{};
>   [[maybe_unused]] uint8_t sink[] = { a[N/8] |= (1 << (N%8)) ... };
>   return a;
> }
> 

Looks like it no more wastes memory now.

> constexpr auto error_bitmap = make_error_bitmap EBUSY>();
> 
> (Note that max_index has the division by 8)
> 
> 
> >
> > IOW, shouldn't you declare "array a{};" ?
> >

And this expression had an off-by-one error, sorry.

> > > constexpr auto error_bitmap = make_error_bitmap > > EBUSY>();
> > >
> > > It seems to produce the intended bit pattern.
> >
> > Did you think of big-endian machines (just curious)?
> 
> The code does what the OP asked for, I didn't try to figure out if
> what it did made sense.

Ok.

Gabriel

Re: RFC: allowing compound assignment operators with designated initializers

2018-10-15 Thread Gabriel Paubert

On Mon, Oct 15, 2018 at 08:11:42PM +0200, Florian Weimer wrote:
> * Jonathan Wakely:
> 
> > On Sun, 14 Oct 2018 at 20:46, Florian Weimer  wrote:
> >>
> >> * Rasmus Villemoes:
> >>
> >> > This is something I've sometimes found myself wishing was supported. The
> >> > idea being that one can say
> >> >
> >> > unsigned a[] = { [0] = 1, [1] = 3, [0] |= 4, ...}
> >> >
> >> > which would end up initializing a[0] to 5. As a somewhat realistic
> >> > example, suppose one is trying to build a bitmap at compile time, but
> >> > the bits to set are not really known in the sense that one can group
> >> > those belonging to each index in a usual | expression. Something like
> >> >
> >> > #define _(e) [e / 8] |= 1 << (e % 8)
> >> > const u8 error_bitmap[] = { _(EINVAL), _(ENAMETOOLONG), _(EBUSY), ... }
> >>
> >> I think it wouldn't be too hard to extend std::bitset with more
> >> compile-time operations to support this, if that's what you need.
> >
> > It's already doable using C++17:
> 
> I didn't doubt that, it's just that I'd expect to be able to use
> std::bitset for this.
> 
> > template
> > constexpr auto
> > make_error_bitmap()
> > {
> >   using std::uint8_t;
> >   using std::array;
> >   constexpr auto max_index = std::max_element({N...}) / 8;
> >   array a;
> >   [[maybe_unused]] uint8_t sink[] = { a[N/8] |= (1 << (N%8)), ... };
> >   return a;
> > }
> >
> > constexpr uint8_t error_bitmap = make_error_bitmap > ENAMETOOLONG, EBUSY>();
> >
> > (This won't compile in C++14 because std::array can't be modified in a
> > constant expression until C++17).
> 
> You wrote that without testing it?  I'm impressed.  It's really close.
> 
> template
> constexpr auto
> make_error_bitmap()
> {
>   using std::uint8_t;
>   using std::array;
>   constexpr auto max_index = std::max({ N... });
>   array a{};
>   [[maybe_unused]] uint8_t sink[] = { a[N/8] |= (1 << (N%8)) ... };
>   return a;
> }
> 

Hmm, isn't the array roughly 8 times too large?

IOW, shouldn't you declare "array a{};" ?

> constexpr auto error_bitmap = make_error_bitmap EBUSY>();
> 
> It seems to produce the intended bit pattern.

Did you think of big-endian machines (just curious)? 

Gabriel
> 
> > Of course the response will be "but I don't want to use C++" ...
> 
> Indeed.

[OT] Re: Bit-field struct member sign extension pattern results in redundant

2017-08-18 Thread Gabriel Paubert

On Fri, Aug 18, 2017 at 10:56:10PM +1200, Michael Clark wrote:
> 
> > On 18 Aug 2017, at 10:41 PM, Gabriel Paubert <paub...@iram.es> wrote:
> > 
> > On Fri, Aug 18, 2017 at 10:29:04AM +1200, Michael Clark wrote:
> >> Sorry I had to send again as my Apple mailer is munging emails. I’ve 
> >> disabled RTF.
> >> 
> >> 
> >> This one is quite interesting:
> >> 
> >> - https://cx.rv8.io/g/WXWMTG
> >> 
> >> It’s another target independent bug. x86 is using some LEA followed by SAR 
> >> trick with a 3 bit shift. Surely SHL 27, SAR 27 would suffice. In any case 
> >> RISC-V seems like a nice target to try to fix this codegen for, as its 
> >> less risk than attempting a fix in x86 ;-)
> >> 
> >> - https://github.com/riscv/riscv-gcc/issues/89
> >> 
> >> code:
> >> 
> >>template 
> >>inline T signextend(const T x)
> >>{
> >>struct {T x:B;} s;
> >>return s.x = x;
> >>}
> >> 
> >>int sx5(int x) {
> >>return signextend(x);
> >>}
> >> 
> >> riscv asm:
> >> 
> >>sx5(int):
> >>  slliw a0,a0,3
> >>  slliw a0,a0,24
> >>  sraiw a0,a0,24
> >>  sraiw a0,a0,3
> >>  ret
> >> 
> >> hand coded riscv asm
> >> 
> >>sx5(int):
> >>  slliw a0,a0,27
> >>  sraiw a0,a0,27
> >>  ret
> >> 
> >> x86 asm:
> >> 
> >>sx5(int):
> >>  lea eax, [0+rdi*8]
> >>  sar al, 3
> >>  movsx eax, al
> >>  ret
> >> 
> >> hand coded x86 asm (no worse because the sar depends on the lea)
> >> 
> >>sx5(int):
> >>  shl edi, 27
> >>  sar edi, 27
> >>  movsx eax, dl
> > 
> > Huh? dl is not a subreg of edi!
> > 
> > s/edi/edx/ and it may work.
> > 
> > dil can also be used, but only on 64 bit.
> 
> Sorry I meant dil on x86-64. I was sure that it was possible to extend into 
> another register. 

It is, but only from the first 4 registers on 32 bit: AL, BL, CL, and
DL can be accessed as 8 bit registers, as well as the high part of the
16 bit registers: AH, BH, CH and DH.

This is essentially a left-over of the 16 bit processors (8086 to
80286), carried over without change in the 32 bit processors.

64 bit extensions made registers more orthogonal in this respect, but
there is still some weirdness due to historical reasons.

> I have not done much i386 asm so I am unaware of the constraints. 

Fortunate you are, especially since you hopefully did not have to live
through the 16 bit nightmare of segments and near and far addresses.

>Can the source and dest registers for movsx not differ on i386? I thought they 
>could.

They can, but only from the first 4 registers. On the other hand
movsx eax,dh (in Intel's syntax) is perfectly valid.

> In any case, the plot thickens…
> 
> I believe we have bugs on both RISC-V and Aarch64.

Maybe, I don't know either arch well enough, but RISC-V looks intuitive
enough to understand that it is bad both in terms of performance and
code size.

Another arch you could try is Power.

> I found that it at least appears like it is transitioning to a char or short 
> as the break is at 24 and 16 depending on the width, and values over 16 work 
> as one would expect.
> 
> Here is an updated test program: https://cx.rv8.io/g/M9ewNf
> 
>   template 
>   inline T signextend(const T x)
>   {
> struct {T x:B;} s;
> return s.x = x;
>   }
> 
>   int sx3(int x) { return signextend(x); }
>   int sx5(int x) { return signextend(x); }
>   int sx11(int x) { return signextend(x); }
>   int sx14(int x) { return signextend(x); }
>   int sx19(int x) { return signextend(x); }
> 
> I filed a bug on riscv-gcc but I think it is target independent code given 
> there appears to be an issue on Aarch64. AFAICT, Aarch64 should generate a 
> single sbfx for all of the test functions.
> 
> - https://github.com/riscv/riscv-gcc/issues/89
> 
> Should I file a bug on GCC bugzilla given it looks to be target independent?

I think so, but you should get confirmation from someone else.

> 
> On RISC-V, the codegen is much more obviously wrong, but essentially the same 
> thing is happening on Aarch64 but there is only one additional instruction 
> instead of two.
> 
>   sx3(int):
> slliw a0,a0,5
> slliw a0,a0,24
> sraiw a0,a0,24
> sraiw a0,a0,5
> ret
>   sx5(int):
> slliw a0,a0,3
> slliw a0,a0,24
> sraiw a0,a0,24
> sraiw a0,a0,3
> ret
>   sx11(int):
> slliw a0,a0,5
> slliw a0,a0,16
> sraiw a0,a0,16
> sraiw a0,a0,5
> ret
>   sx14(int):
> slliw a0,a0,2
> slliw a0,a0,16
> sraiw a0,a0,16
> sraiw a0,a0,2
> ret
>   sx19(int):
> slliw a0,a0,13
> sraiw a0,a0,13
> ret
> 
>

Re: Bit-field struct member sign extension pattern results in redundant

2017-08-18 Thread Gabriel Paubert

On Fri, Aug 18, 2017 at 10:29:04AM +1200, Michael Clark wrote:
> Sorry I had to send again as my Apple mailer is munging emails. I’ve disabled 
> RTF.
> 
> 
> This one is quite interesting:
> 
> - https://cx.rv8.io/g/WXWMTG
> 
> It’s another target independent bug. x86 is using some LEA followed by SAR 
> trick with a 3 bit shift. Surely SHL 27, SAR 27 would suffice. In any case 
> RISC-V seems like a nice target to try to fix this codegen for, as its less 
> risk than attempting a fix in x86 ;-)
> 
> - https://github.com/riscv/riscv-gcc/issues/89
> 
> code:
> 
>   template 
>   inline T signextend(const T x)
>   {
>   struct {T x:B;} s;
>   return s.x = x;
>   }
> 
>   int sx5(int x) {
>   return signextend(x);
>   }
> 
> riscv asm:
> 
>   sx5(int):
> slliw a0,a0,3
> slliw a0,a0,24
> sraiw a0,a0,24
> sraiw a0,a0,3
> ret
> 
> hand coded riscv asm
> 
>   sx5(int):
> slliw a0,a0,27
> sraiw a0,a0,27
> ret
> 
> x86 asm:
> 
>   sx5(int):
> lea eax, [0+rdi*8]
> sar al, 3
> movsx eax, al
> ret
> 
> hand coded x86 asm (no worse because the sar depends on the lea)
> 
>   sx5(int):
> shl edi, 27
> sar edi, 27
> movsx eax, dl

Huh? dl is not a subreg of edi!

s/edi/edx/ and it may work.

dil can also be used, but only on 64 bit.

Gabriel

Re: Complex multiplication in gcc

2017-07-17 Thread Gabriel Paubert

On Mon, Jul 17, 2017 at 10:51:21AM -0600, Sean McAllister wrote:
> When generating code for a simple inner loop (instantiated with
> std::complex)
> 
> template 
> void __attribute__((noinline)) benchcore(const cx* __restrict__ aa,
> const cx* __restrict__ bb, const cx* __restrict__ cc, cx* __restrict__
> dd, cx uu, cx vv, size_t nn) {
> for (ssize_t ii=0; ii < nn; ii++) {
> dd[ii] = (
> aa[ii]*uu +
> bb[ii]*vv +
> cc[ii]
> );
> }
> }
> 
> g++ generates the following assembly code (g++ 7.1.0) (compiled with:
> g++ -I. test.cc -O3 -ggdb3 -o test)

[snipped]
> 
> The interesting part is the two calls to __mulsc3, which the docs
> indicate computes complex multiplication according to Annex G of the
> C99 standard.  This leads me to two questions.
> 
> First, disassembling __mulsc3 doesn't seem to contain anything:
> 
> (gdb) disassemble __mulsc3
> Dump of assembler code for function __mulsc3@plt:
>0x00400aa0 <+0>: jmpq   *0x2035d2(%rip)# 0x604078
>0x00400aa6 <+6>: pushq  $0xc
>0x00400aab <+11>: jmpq   0x4009d0
> End of assembler dump.
> 
> What's the cause of this?

That you are disassembling the PLT (note __mulsc3@plt), which redirects
to the real function which is provided by libgcc (on my computer the
exact location is /lib/x86_64-linux-gnu/libgcc_s.so.1).

> 
> Second, since I don't think I'll convince anyone to generate
> non-standard conforming code by default, could the default performance
> of complex multiplication be enhanced significantly by performing the
> isnan() checks required by Annex G and only calling the function to
> fix the results if they fail?  That would move the function call
> overhead out of the critical path at least.

Gabriel

Re: Translation breaks IDE

2017-03-17 Thread Gabriel Paubert

On Fri, Mar 17, 2017 at 12:28:48PM +, Jonathan Wakely wrote:
> On 17 March 2017 at 12:17, Frédéric Marchal wrote:
> > On Friday 17 March 2017 13:32:17 Janne Blomqvist wrote:
> >> Not my area of expertise, but it seems the Glorious Future (TM) in
> >> this area is something called the "language server protocol", see
> >> http://langserver.org/ . Though AFAIK nobody is working on GCC
> >> integration so far.
> >
> > I was looking for a short term solution. Not something that might possibly 
> > be
> > available in 20 years :-)
> 
> Changing GCC's output and getting IDEs to support it isn't exactly
> short term either (and the suggested (E) additions look ugly IMHO).
> 
> > Translations are unusable from within an IDE until gcc offers some solution 
> > to
> > let the IDE detects errors and warnings irrespective of the selected 
> > language.
> >
> > Currently, every single translated gcc*.po file is affected (Spanish and
> > Indonesian users would still see errors as "error" apparently translates to
> > "error" in those languages).
> 
> Or the translators decided not to translate those words, maybe for this 
> reason.

For Spanish at least, "error" is the correct translation, although the
pronunciation is vastly different.

Gabriel

Re: GCC libatomic ABI specification draft

2016-12-02 Thread Gabriel Paubert

On Thu, Dec 01, 2016 at 11:13:37AM -0800, Bin Fan at Work wrote:
> Hi Szabolcs,
> 
> > On Nov 29, 2016, at 3:11 AM, Szabolcs Nagy  wrote:
> > 
> > On 17/11/16 20:12, Bin Fan wrote:
> >> 
> >> Although this ABI specification specifies that 16-byte properly aligned 
> >> atomics are inlineable on platforms
> >> supporting cmpxchg16b, we document the caveats here for further 
> >> discussion. If we decide to change the
> >> inlineable attribute for those atomics, then this ABI, the compiler and 
> >> the runtime implementation should be
> >> updated together at the same time.
> >> 
> >> 
> >> The compiler and runtime need to check the availability of cmpxchg16b to 
> >> implement this ABI specification.
> >> Here is how it would work: The compiler can get the information either 
> >> from the compiler flags or by
> >> inquiring the hardware capabilities. When the information is not 
> >> available, the compiler should assume that
> >> cmpxchg16b instruction is not supported. The runtime library 
> >> implementation can also query the hardware
> >> compatibility and choose the implementation at runtime. Assuming the user 
> >> provides correct compiler options
> > 
> > with this abi the runtime implementation *must* query the hardware
> > (because there might be inlined cmpxchg16b in use in another module
> > on a hardware that supports it and the runtime must be able to sync
> > with it).
> 
> Thanks for the comment. Yes, the ABI requires libatomic must query the 
> hardware. This is 
> necessary if we want the compiler to generate inlined code for 16-byte 
> atomics. Note that 
> this particular issue only affects x86. 

Why? Power (at least recent ones) has 128 bit atomic instructions
(lqarx/stqcx.) and Z has 128 bit compare and swap. 

Gabriel

Re: Debugger support for __float128 type?

2015-10-01 Thread Gabriel Paubert

On Thu, Oct 01, 2015 at 12:42:05AM +0200, Mark Kettenis wrote:
> > Date: Wed, 30 Sep 2015 19:33:44 +0200 (CEST)
> > From: "Ulrich Weigand" 
> > 
> > Hello,
> > 
> > I've been looking into supporting __float128 in the debugger, since we're
> > now introducing this type on PowerPC.  Initially, I simply wanted to do
> > whatever GDB does on Intel, but it turns out debugging __float128 doesn't
> > work on Intel either ...
> > 
> > The most obvious question is, how should the type be represented in
> > DWARF debug info in the first place?  Currently, GCC generates on i386:
> > 
> > .uleb128 0x3# (DIE (0x2d) DW_TAG_base_type)
> > .byte   0xc # DW_AT_byte_size
> > .byte   0x4 # DW_AT_encoding
> > .long   .LASF0  # DW_AT_name: "long double"
> > 
> > and
> > 
> > .uleb128 0x3# (DIE (0x4c) DW_TAG_base_type)
> > .byte   0x10# DW_AT_byte_size
> > .byte   0x4 # DW_AT_encoding
> > .long   .LASF1  # DW_AT_name: "__float128"
> > 
> > On x86_64, __float128 is encoded the same way, but long double is:
> > 
> > .uleb128 0x3# (DIE (0x31) DW_TAG_base_type)
> > .byte   0x10# DW_AT_byte_size
> > .byte   0x4 # DW_AT_encoding
> > .long   .LASF0  # DW_AT_name: "long double"
> > 
> > Now, GDB doesn't recognize __float128 on either platform, but on i386
> > it could at least in theory distinguish the two via DW_AT_byte_size.
> > 
> > But on x86_64 (and also on powerpc), long double and __float128 have
> > the identical DWARF encoding, except for the name.
> > 
> > Looking at the current DWARF standard, it's not really clear how to
> > make a distinction, either.  The standard has no way to specifiy any
> > particular floating-point format; the only attributes for a base type
> > of DW_ATE_float encoding are related to the size.
> > 
> > (For the Intel case, one option might be to represent the fact that
> > for long double, there only 80 data bits and the rest is padding, via
> > some combination of the DW_AT_bit_size and DW_AT_bit_offset or
> > DW_AT_data_bit_offset attributes.  But that wouldn't help for PowerPC
> > since both long double and __float128 really use 128 data bits,
> > just different encodings.)
> > 
> > Some options might be:
> > 
> > - Extend the official DWARF standard in some way
> > 
> > - Use a private extension (e.g. from the platform-reserved
> >   DW_AT_encoding value range)
> > 
> > - Have the debugger just hard-code a special case based
> >   on the __float128 name 
> > 
> > Am I missing something here?  Any suggestions welcome ...
> > 
> > B.t.w. is there interest in fixing this problem for Intel?  I notice
> > there is a GDB bug open on the issue, but nothing seems to have happened
> > so far: https://sourceware.org/bugzilla/show_bug.cgi?id=14857
> 
> Perhaps you should start with explaining what __float128 actually is
> on your specific platform?  And what long double actually is.
> 
> I'm guessing long double is a what we sometimes call an IBM long
> double, which is essentially two IEEE double-precision floating point
> numbers packed together and that __float128 is an attempt to fix
> history and have a proper IEEE quad-precision floating point type ;).
> And that __float128 isn't actually implemented in hardware.

An IBM mainframe might want to discuss this point with you :-).

See pages 24-25 of http://arith22.gforge.inria.fr/slides/s1-schwarz.pdf

Latencies are decent, not extremely low, but we are speaking of a
processor clocked at 5GHz, so the latencies are 2.2ns for add/subtract, 
4.6ns for multiplications, and ~10ns for division. 

To put things in perspective, how many cycles is a memory access which 
misses in both L1 and L2 caches these days?

> The reason people haven't bothered to fix this, is probably because
> nobody actually implements quad-precision floating point in hardware.
> And software implementations are so slow that people don't really use
> them unless they need to.  Like I did to nomerically calculate some
> asymptotic expansions for my Thesis work...

Which would probably run much faster if ported to a z13.

Gabriel

Re: ppc eabi float arguments

2015-09-23 Thread Gabriel Paubert

On Wed, Sep 23, 2015 at 07:09:43PM -0400, Michael Meissner wrote:
> On Tue, Sep 22, 2015 at 01:43:55PM -0400, David Edelsohn wrote:
> > On Tue, Sep 22, 2015 at 1:39 PM, Bernhard Schommer
> >  wrote:
> > > Hi,
> > >
> > > if been working with the windriver Diab c compiler for 32bit ppc for  and
> > > encountered an incompatibly with the eabi version of the gcc 4.83. When
> > > calling functions with more than 8 float arguments the gcc stores the 9th
> > > float argument (and so on) as a float where as the diab compiler stores 
> > > the
> > > argument as a double using 8 byte.
> > >
> > > I checked the EABI document and it seems to support the way the diab
> > > compiler passes the arguments:
> > >
> > > "Arguments not otherwise handled above [i.e. not passed in registers]
> > > are passed in the parameter words of the caller=E2=80=99s stack frame. 
> > > [...=
> > > ]
> > > float, long long (where implemented), and double arguments are
> > > considered to have 8-byte size and alignment, *with float arguments
> > > converted to double representation*. "
> > >
> > > Does anyone know the reason why the gcc passes the argument as single 
> > > float?
> > 
> > Hi, Bernhard
> > 
> > First, are you certain that you have the final version of the 32 bit
> > PPC eABI? There were a few versions in circulation.
> > 
> > Mike may remember the history of this.
> 
> Well I worked on it around 1980 or so. 

You worked on PPC 10 years before the first Power systems were
announced?

Amazing foresight :-)

> I don't remember the details (nor do I
> have the original manuals I was working from).  From this distance, it sure
> looks like a bug, but I'm not sure whether it should be fixed or 
> grand-fathered
> in (and updating the stdargs.h support, if this is the offical calling
> sequence).

Gabriel

Re: [RFC] Design for flag bit outputs from asms

2015-05-05 Thread Gabriel Paubert

On Mon, May 04, 2015 at 12:33:38PM -0700, Richard Henderson wrote:
[snipped]
 (3) Note that ppc is both easier and more complicated.
 
   There we have 8 4-bit registers, although most of the integer
   non-comparisons only write to CR0.  And the vector non-comparisons
   only write to CR1, though of course that's of less interest in the
   context of kernel code.

Actually vector (Altivec) write to CR6. Standard FPU optionally write to
CR1, but the written value does not exactly depend on the result of the last
instruction; it is an instead an accrued exception status.

 
   For the purposes of cr0, the same scheme could certainly work, although
   the hook would not insert a hard register use, but rather a pseudo to
   be allocated to cr0 (constaint x).

Yes, but we might also want to leave the choice of a cr register to the 
compiler.

 
   That said, it's my understanding that dot insns, setting cr0 are
   expensive in current processor generations.  

Not that much if I understand properly power7.md and power8.md: 
no (P7) or one (P8) additional clock for common instructions 
(add/sub and logical), but nothing else, so they are likely a win. 

Shift/rotate/sign extensions seem to have more decoding restrictions: 
the recording (dot) forms are cracked and use 2 integer units.

   There's also a lot less
   of the x86-style operate and set a flag based on something useful.
 

But there is at least an important one, which I occasionally wished I had: 
the conditional stores. 

The overflow bit might also be useful, not really 
for the kernel, but for applications (and mfxer is slow).

Regards,
Gabriel

Re: designated initializers extension and sparc

2013-06-17 Thread Gabriel Paubert

On Mon, Jun 17, 2013 at 01:28:56AM +0300, Sergey Kljopov wrote:
 Hi,
 
 Reading the text
 -
 In a structure initializer, specify the name of a field to
 initialize with `.fieldname =' before the element value. For
 example, given the following structure,
  struct point { int x, y; };
 the following initialization
  struct point p = { .y = yvalue, .x = xvalue };
 is equivalent to
  struct point p = { xvalue, yvalue };
 Another syntax which has the same meaning, obsolete since GCC 2.5,
 is `fieldname:', as shown here:
  struct point p = { y: yvalue, x: xvalue };
 The `[index]' or `.fieldname' is known as a designator. You can also
 use a designator (or the obsolete colon syntax) when initializing a
 union, to specify which element of the union should be used. For
 example,
  union foo { int i; double d; };
  union foo f = { .d = 4 };
 will convert 4 to a double to store it in the union using the second
 element. By contrast, casting 4 to type union foo would store it
 into the union as the integer i, since it is an integer. (See Cast
 to Union.)
 -
 I wrote the following test:
 
 union foo { int i; double d; };
 
 int main(int argc, char **argv)
 {
 union foo f = { .d = 4 };
 
 ASSERT_EQ(0, f.i);
 ASSERT_FEQ(4.0, f.d);
 
 return 0;
 }
 
 ASSERT_EQ and ASSERT_FEQ are some macros which checks the falue and
 gives some error messages.
 
 It seems that this extension should be bi-endian, 

It is not. But this is off-topic on this mailing list, which is about
the development of the compiler, not using it. 

Please try gcc-help instead.

Gabriel

Re: Conditional sibcalls?

2013-02-21 Thread Gabriel Paubert

On Thu, Feb 21, 2013 at 03:57:05PM +0400, Konstantin Vladimirov wrote:
 Hi,
 
 Discovered this optimization possibilty on private backend, but can
 easily reproduce on x86
 
 Consider code, say test.c:
 
 static __attribute__((noinline)) unsigned int*
 proxy1( unsigned int* codeBuffer, unsigned int oper, unsigned int a, unsigned 
 in
 {
 return codeBuffer;
 }
 
 static __attribute__((noinline)) unsigned int*
 proxy2( unsigned int* codeBuffer, unsigned int oper, unsigned int a, unsigned 
 in
 {
 return codeBuffer;
 }
 
 __attribute__((noinline)) unsigned int*
 myFunc( unsigned int* codeBuffer, unsigned int oper)
 {
 if( (oper  0xF) == 14)
 {
 return proxy1( codeBuffer, oper, 0x22, 0x2102400b);
 }
 else
 {
 return proxy2( codeBuffer, oper, 0x22, 0x1102400b);
 }
 }
 
 With ~/x86-toolchain-4.7.2/bin/gcc -O2 -fomit-frame-pointer -S test.c,
 gcc yields:
 
 myFunc:
 .LFB2:
   .cfi_startproc
   andl  $15, %esi
   cmpl  $14, %esi
   je  .L6
   jmp proxy2.isra.1
   .p2align 4,,10
   .p2align 3
 .L6:
   jmp proxy1.isra.0
 
 Which can be simplified to:
 
 myFunc:
 .LFB2:
   .cfi_startproc
   andl  $15, %esi
   cmpl  $14, %esi
   je  proxy2.isra.1  // --- conditional sibling call here
   .p2align 4,,10
   .p2align 3
   jmp proxy1.isra.0

Apart from the je/jne thinko you mentioned, the .p2align directives
are completely useless in this case.

Regards,
Gabriel

Re: SPEC2000 comparison of LLVM-3.2 and coming GCC4.8 on x86/x86-64

2013-02-08 Thread Gabriel Paubert

On Thu, Feb 07, 2013 at 11:46:04AM -0500, Vladimir Makarov wrote:
 On 02/07/2013 11:09 AM, Richard Biener wrote:
 On Thu, Feb 7, 2013 at 4:26 PM, Vladimir Makarov vmaka...@redhat.com wrote:
 I've add pages comparing LLVM-3.2 and coming GCC 4.8 on
 http://vmakarov.fedorapeople.org/spec/.
 
 The pages are accessible by links named GCC-LLVM comparison, 2013, x86 and
 x86-64 SPEC2000 under link named 2013. You can find these links at the
 bottom of the left frame.
 
 If you prefer email for reading the comparison, here is the copy of page
 accessible by link named 2013:
 
 
 Comparison of GCC and LLVM in 2013.
 
 This year the comparison is done on coming *GCC 4.8* and *LLVM 3.2*
 which was released at the very end of 2012.
 
 As usually I am focused mostly on the compiler comparison as
 *optimizing* compilers on major platform x86/x86-64.  I don't consider
 other aspects of the compilers as quality of debug information
 (especially in optimizations modes), supported languages, standards
 and extensions (e.g. OMP), supported targets and ABI, support of
 just-in-time compilation etc.
 
 This year I did the comparison using following major options
 equivalent with my point of view:
 
 o *-O0 -g, -Os, -O1, -O2, -O3, -O4* for LLVM3.2
 o *-O0 -g, -Os, -O1, -O2, -O3, -Ofast -flto* for GCC4.8
 On the web-page you say that you use -Ofast -fno-fast-math (because
 that is what LLVM does with -O4).  For GCC that's equivalent to -O3
 (well, apart from that you enable -flto).  So you can as well say you
 tested -O3 -flto.
 I guess -Ofast -fno-fast-math is not just -O3 but you are right it
 is pretty close.
 For 32bit you used -mtune=corei7 -march=i686 - did you disable
 CPU features like SSE on purpose?  Vectorization at -O3+ should
 have used those (though without -ffast-math FP vectorization is
 seriously restricted).
 Yes, I did it on purpose.  Some 32-bit Linux distributions (e.g.
 Fedora) uses this architecture.  Another reason is that I'd like to
 see how good compilers work with fp stack (I got an impression that
 LLVM generates less fp stack regs shuffles, so I think we could
 improve regstack.c).  Although it is a dying architecture and
 probably we should pay more attention to SSE architectures

The FP stack is fortunately being put to rest in the museum of horrors.

This said for processors starting from the Pentium and until SSE2 (the 
one with double precision) became available, the FXCH instruction was 
made essentially free from an execution unit and timing point of view 
(only taking a decoder slot). Stack shuffles implemented with the FXCH 
instruction were the only way to extract parallelism from floating point 
code. So as long as the stack shuffles are using FXCH, they are a valid
optimization, and only hint at some historical burden that LLVM never
has had to bear with.

Regards,
Gabriel

Re: not-a-number's

2013-01-17 Thread Gabriel Paubert

On Thu, Jan 17, 2013 at 12:21:04PM +0100, Vincent Lefevre wrote:
 On 2013-01-17 06:53:45 +0100, Mischa Baars wrote:
  Also this was not what I intended to do, I was trying to work with quiet
  not-a-numbers explicitly to avoid the 'invalid operation' exception to be
  triggered, so that my program can work it's way through the calculations
  without being terminated by the intervention of a signal handler. When any
  not-a-number shows up in results after execution, then you know something
  went wrong. When the program does what it is supposed to do, you could
  remove any remaining debug code, such as the boundary check in the
  trigonometric functions.
 
 Checking whether a (final) result is NaN is not always the correct way
 to detect whether something was wrong, because a NaN can disappear.
 For instance, pow(NaN,0) gives 0, not NaN. 

Actually 1, but that's a detail. NaN typically propagate, but there
are a few exceptions.

 You need to check the
 invalid operation exception flag (if you don't want a trap). And
 this flag will also be set by the =, , , = comparisons when one
 of the operands is NaN, which should be fine for you.

Hmm, are you sure that this part is properly implemented in gcc, using
unordered compare versus ordered compare instructions depending on 
the operator you use?

At least on PPC, floating point compares always use the unordered 
instruction (fcmpu) and never the ordered one (fcmpo), so exceptions
flags are never set. 

Regards,
Gabriel

Re: Deprecate i386 for GCC 4.8?

2012-12-20 Thread Gabriel Paubert

On Thu, Dec 13, 2012 at 12:51:29PM +0100, David Brown wrote:
 Is there much to be gained from keeping 486 support - or
 alternatively, is there much to be gained by dropping it at the same
 time? 

In practice, there is very little difference betweeen 486 and Pentium
for code what will be generated by a compiler (tuning is different, 
especially for the FPU). 

After digging the instruction set differences, the following instructions
were implemented on the 486 but not on the 386:
1) bswap: ntohl and htonl directly map to it, heavily used in the
   networking code, but not earth shattering.
2) xadd and cmpxchg: basic building blocks for atomic operations, 
   the main reason to drop i386 IMHO.
3) cache management instructions: invd and wbinvd
4) tlb management: invlpg

3) and 4) don't affect the compiler, only the kernel. However, the main
reason to drop the 386 in the Linux kernel is the fundamentally broken
behaviour of accessing user space from the kernel, fixed in the 486 
and later by the WP bit. This caused memory management nightmares in
the kernel (actually, I'm not even convinced that there were no
security holes).

Between the 486 and the Pentium, the differences are:
+1) cmpxchg8b: may be useful for some atomic operations, but not 
all 32 bit architectures have 64 bit atomic operations (to
my knowledge, only x86, m68k and s390), so arm/mips/ppc and
others have to deal with this limitation.
   
+2) cpuid: its results may be used when selecting libraries and 
code paths, but hopefully from variables set after executing
it once at program/library initialization.

+3) rdtsc: maybe useful for performance measurements but apart
from this abstracted in gettimeofday() and more modern
equivalents 

+4) rdmsr/wrmsr: only kernel code

-5) mov to/from test registers: removed on Pentium, only kernel code

+6) rsm (exit the bloody SMM mode): never used by a compiler (without 
considering that fact that SMM is a potential latency nightmare 
if used on embedded systems)

Actually, I believe that some (late) 486 implemented cpuid and rdtsc.

So only point 1) from the above list is relevant.

 If 586 has been the standard configuration for the last two
 releases of gcc, and 686 has been the standard for most 32-bit x86
 Linux distributions for a number of years, perhaps it is worth
 deprecating 486 (and maybe even 586) at the same time.  After all,
 deprecating targets does not mean that they are dead - users can
 always use older versions of gcc, and can argue against the
 deprecation if it is affecting them.

Either drop both 486 and 586 or none. There is nothing significant
to be gained from dropping only the former. PPro has cmov/fcmov,
floating point compares directly setting flags, but still only the
x87 FPU. After this the next significant step is SSE2 in Pentium-III
(SSE is single precision only, not a general purpose FPU).

Gabriel

Re: C++98/C++11 ABI compatibility for gcc-4.7

2012-06-15 Thread Gabriel Paubert

On Fri, Jun 15, 2012 at 10:52:27PM +0200, Paolo Carlini wrote:
 Hi,
 
  On Fri, Jun 15, 2012 at 3:12 PM, James Y Knight f...@fuhm.net wrote:
  
  IMO, at the /very least/, libstdc++ should go ahead and change std::string
  to be the new implementation. Once std::string is ABI-incompatible between
  the modes, there's basically no chance that anyone would think that
  linking things together from the two modes is a good thing to try, for
  more than a couple minutes.
  
  
  Agreed.
 
 Seconded. I find the idea simple but cute.

I strongly agree.

 
 Paolo
 
  
 
 Ps: I'm not volunteering to do the actual work ;) kidding, sooner or later 
 have to do it anyway. Just not right *now*. I could do it in a couple of 
 weeks, definitely in time for 4.8.

Does this basically mean that compiling C++ code with GCC4.7 will be playing 
Russian roulette?

Gabriel

Re: %pc relative addressing of string literals/const data

2010-10-06 Thread Gabriel Paubert

On Tue, Oct 05, 2010 at 10:55:36PM +0200, Joakim Tjernlund wrote:
 Richard Henderson r...@redhat.com wrote on 2010/10/05 20:56:55:
 
  On 10/05/2010 06:54 AM, Joakim Tjernlund wrote:
   Ian Lance Taylor i...@google.com wrote on 2010/10/05 15:47:38:
   Joakim Tjernlund joakim.tjernl...@transmode.se writes:
   While doing relocation work on u-boot I often whish for strings/const 
   data
   to be accessible through %pc relative address rather than and ABS 
   address
   or through GOT. Has this feature ever been considered by gcc?
  
   The feature can only be supported on processors which are reasonably
   able to support it.  So, what target are you asking about?
  
   In my case PowerPC but I think most arch's would want this feature.
   Is there arch's where this cannot be support at all or just within
   some limits? I think within limits could work for small apps
   like u-boot.
 
  PowerPC doesn't really have the relocations for pc-relative offsets
  within code -- primarily because it doesn't have pc-relative insns
  to match.  Nor, unfortunately, does it have got-relative relocations,
  like some other targets.  These are normally how we get around not
  having pc-relative relocations and avoiding tables in memory.  C.f.
 
#pragma GCC visibility push(hidden)
extern int x, y;
int foo(void) { return x + y; }
 
  Without pragma (-O2 -fpic):
  i386:
  movly...@got(%ecx), %eax
  movlx...@got(%ecx), %edx
  movl(%eax), %eax
  addl(%edx), %eax
 
  alpha:
  ldq $1,y($29)   !literal
  ldl $0,0($1)
  ldq $1,x($29)   !literal
  ldl $1,0($1)
  addl $0,$1,$0
 
  In both cases here, we have load the address from memory, from
  the GOT table, then load the data (X or Y) from memory and
  perform the addition.
 
 
  With pragma:
  i386:
  movly...@gotoff(%ecx), %eax
  addlx...@gotoff(%ecx), %eax
 
  alpha (-fpic):
  ldl $1,x($29)   !gprel
  ldl $0,y($29)   !gprel
  addl $0,$1,$0
 
  alpha (-fPIC):
  ldah $1,y($29)  !gprelhigh
  ldl $0,y($1)!gprellow
  ldah $1,x($29)  !gprelhigh
  ldl $1,x($1)!gprellow
  addl $0,$1,$0
 
  In all cases here, we've replaced the load from the GOT table
  with arithmetic.  In the case of i386 this gets folded into the
  memory reference.  The alpha cases are essentially the same as
  what ppc could generate if it had the appropriate relocations.
 
 I don't do x86 or alpha so let me ask: If you run the code on an address
 != link address, will it do the right thing?
 
 I tested the #pragma/no #pragma on PPC and the resulting code
 was the same:
 /* with #pragma, -fpic -O2 -mregnames
 foo:
   stwu %r1,-16(%r1)
   mflr %r12
   bcl 20,31,.LCF0
 .LCF0:
   stw %r30,8(%r1)
   mflr %r30
   addis %r30,%r30,_global_offset_table_-.l...@ha
   addi %r30,%r30,_global_offset_table_-.l...@l
   mtlr %r12
   lwz %r9,y...@got(%r30)
   lwz %r3,0(%r9)
   lwz %r9,x...@got(%r30)
   lwz %r30,8(%r1)
   addi %r1,%r1,16
   lwz %r0,0(%r9)
   add %r3,%r3,%r0
   blr
  */
 
 
 You can get at the GOT table using PC relative addressing so why not
 strings or data in a similar fashion?

Because you access the got with a single (16 bit offset) instructions.
If you add .rodata, .data and .bss, you will likely overflow it 
quite rapidly.

Did you look at -mrelocatable? 

I don't know whether it can solve all your problems (probably not),
and there are regular threats of removing it, or at least there
were, since I see that the documentation was much improved in August. 

Anyway, I used it a decade ago to write a bootloader that:

a) was loaded at an unpredictable and unconfigurable address
depending on the mood of the firmware (really, it depended at 
least on the media from which you booted, even exactly the 
same binary image)

b) the first thing the bootloader did was to run the relocation code
for the adress at which it had been loaded by the firmware and 
find the free memory areas from tables provided by the firmware.
The code in this part could not use any global pointer variable, 
AFAIR, but it was short.

c) the bootloader than moved itself where it could, re ran
the relocation code, and then could finally interact a bit with
the user to change kernel options, uncompress the kernel 
and give it control.

I have not touched the code in 8 or 9 years, but it is still
the bootloader used here by ~20 MVME machines.

Regards,
Gabriel

Re: porting GCC to a micro with a very limited addressing mode --- what to write in LEGITIMATE_ADDRESS, LEGITIMIZE_ADDRESS and micro.md ?!

2010-01-25 Thread Gabriel Paubert

On Mon, Jan 25, 2010 at 01:34:09PM +0100, Sergio Ruocco wrote:
 
 Hi everyone,
 
 I am porting GCC to a custom 16-bit microcontroller with very limited
 addressing modes. Basically, it can only load/store using a (general
 purpose) register as the address, without any offset:
 
   LOAD (R2) R1; load R1 from memory at address (R2)
   STORE R1 (R2)   ; store R1 to memory at address (R2)
 
 As far as I can understand, this is more limited than the current
 architectures supported by GCC that I found in the current gcc/config/*.

The Itanium (ia64) has the same limited choice of addressing modes. 

Gabriel

Re: [PATCH] Re: PowerPC : GCC2 optimises better than GCC4???

2010-01-07 Thread Gabriel Paubert

On Wed, Jan 06, 2010 at 04:18:06PM +0100, Jakub Jelinek wrote:
 On Wed, Jan 06, 2010 at 10:15:58AM +, Andrew Haley wrote:
  On 01/06/2010 09:59 AM, Mark Colby wrote:
   Yabbut, how come RTL cse can handle it in x86_64, but PPC not?
  
   Probably because the RTL on x86_64 uses and's and ior's, but PPC uses
   set's of zero_extract's (insvsi).
  
   Aha!  Yes, that'll probably be it.  It should be easy to fix cse to
   recognize those too.
  
   I'm not familiar with the gcc source yet, but just in case I get the
   time to look at this, could anyone give me a file/line ref to dive
   into and examine?
  
  Would you believe cse.c?  :-)
  
  I can't find the line without investigating further.
  
  Andrew.
  
  P.S.  This is a nontrivial task if you don't know gcc, but might be a
  good place for a beginner to start.  OTOH, might be hard: no way to
  know without digging.
 
 I've digged a little bit and this optimizes the testcase on PowerPC 32-bit.
 The patch is completely untested though.
 
 On PowerPC 64-bit which apparently doesn't use ZERO_EXTRACT in this case I
 see a different issue.  It generates
 li 3,0
 ori 3,3,32820
 sldi 3,3,16
 while IMHO 2 insns to load the constant would be completely sufficient,

Indeed.

 apparently rs6000_emit_set_long_const needs work.
   lis 3,0x8034
   extsw 3,3
 or
   li 3,0x401a
   sldi 3,3,17
 etc. do IMHO the same.

Huh? I don't think so:

- first one loads 0x__8034_ in r3, and the extsw looks redundant
  
- second ones ends up with 0x__8034_ in r3, and looks optimal. 

Gabriel

Re: Why no strings in error messages?

2009-09-01 Thread Gabriel Paubert

On Wed, Aug 26, 2009 at 03:02:44PM -0400, Bradley Lucier wrote:
 On Wed, 2009-08-26 at 20:38 +0200, Paolo Bonzini wrote:
  
   When I worked at AMD, I was starting to suspect that it may be more 
   beneficial
   to re-enable the first schedule insns pass if you were compiling in 64-bit
   mode, since you have more registers available, and the new registers do 
   not
   have hard wired uses, which in the past always meant a lot of spills 
   (also, the
   default floating point unit is SSE instead of the x87 stack).  I never got
   around to testing this before AMD and I parted company.
  
  Unfortunately, hardwired use of %ecx for shifts is still enough to kill 
  -fschedule-insns on AMD64.
 
 The AMD64 Architecture manual I found said that various combinations of
 the RSI, RDI, and RCX registers are used implicitly by ten instructions
 or prefixes, and RBX is used by XLAT, XLATB.  So it appears that there
 are 12 general-purpose registers available for allocation.

XLATB is essentially useless (well maybe had some uses back in 16 bit days, 
when only a few registers could be used for addressing) and never generated
by GCC. 

However %ebx is used for PIC addressing in 32 bit mode so it is not 
always free either (I don't know about PIE code).

In 64 bit mode, PIC/PIE use PC relative addressing, so this gives 
you actually 9 more free registers than in 32 bit mode.

However for some reason you glossed over the case of integer division
which always use %edx and %eax. This is true even when dividing by a 
constant (non power of 2) in which case gcc will often use a widening 
multiply instead, whose results are in %edx:%eax, so it's almost a wash 
in terms of fixed register usage (not exactly, the divisions use %edx:%eax 
as dividends and need the divisor somewhere else, while the widening
multiply use %eax as one input but %edx can be used for the other).

(As a side note, %edx and %eax are also special with regard to I/O port
accesses but this is only of interest in device drivers).

 Are 12 registers not enough, in principle, to do scheduling before
 register allocation? 

I don't know, but I would say that you have about 14 registers
for address computations/indexing since you seem to be interested
in FP code. I would think that it is sufficient for many inner
loops (but not all, it really depends on the number of arrays
that you access and the number of independant indexes that
you have to keep).

 I was getting a 15% speedup on some numerical
 codes, as pre-scheduling spaced out the vector loads among the
 floating-point computations.

Well vector loads and floating point computations do not have anything 
to do with integer register choices. The 16 FP registers are 
nicely orthogonal (compared to the real nightmare that the x87 stack was).
In practice you schedule on 16 FP registers and 14 (15 if you omit
the frame pointer) addressing/indexing/counting registers.

In this type of code there are typically very few instructions with
fixed register constraints, and the less likely are the string
instructions. Shifts of variable amount and integer divides
are still possible, but unlikely.

Gabriel

Re: As-if Infinitely Ranged Integer Model

2009-07-27 Thread Gabriel Paubert

On Fri, Jul 24, 2009 at 06:25:12PM +0200, Laurent GUERBY wrote:
 On Fri, 2009-07-24 at 12:03 -0400, Robert Dewar wrote:
  Indeed an alternative approach to handling this problem in GCC would
  be to adapt the Ada model for C and C++ which would not be too hard
  to do I suspect. Then gcc could be improved to handle this model
  better and more effectively with respect to optimization, and both
  C/C++ and Ada would benefit.
 
 IIRC the Ada rules allows to raise Constraint_Error earlier
 rather than later which is interesting for check removal in loops for
 very common patterns:
 
for I in T'First .. Dynamic_N loop
   T (I) := 0.0; -- generate check I in T'First .. T'Last
end loop;
 
 =
 
if Dynamic_N = T'First and Dynamic_N  T'Last then

Huh? I can't understand the first comparison. 

Actually Ada is not Fortran-66 and allows empty loops, no? 

   raise Constraint_Error;
end if;
for I in T'First .. Dynamic_N loop
   T (I) := 0.0; -- no generated check
end loop; 
 
 But I might be wrong in my recollection (and I don't think GNAT
 takes advantage of that), Robert?

I don't think so. And the code quality when checking for
overflows was abysmal last time I tried.

Regards,
Gabriel

Re: As-if Infinitely Ranged Integer Model

2009-07-27 Thread Gabriel Paubert

On Mon, Jul 27, 2009 at 10:46:53AM +0200, Laurent GUERBY wrote:
 On Mon, 2009-07-27 at 09:34 +0200, Gabriel Paubert wrote:
  On Fri, Jul 24, 2009 at 06:25:12PM +0200, Laurent GUERBY wrote:
  for I in T'First .. Dynamic_N loop
 T (I) := 0.0; -- generate check I in T'First .. T'Last
  end loop;
   
   =
   
  if Dynamic_N = T'First and Dynamic_N  T'Last then
  
  Huh? I can't understand the first comparison. 
  
  Actually Ada is not Fortran-66 and allows empty loops, no? 
 
 Ada for loop over A .. B will be empty if A  B and we obviously 
 must not raise an exception if the loop is empty hence the first
 comparison is equivalent to not (T'First  Dynamic_N) which is true
 when the loop is not empty.

Thanks, this makes sense. I don't understand how I did not see it
when first reading the code. Sorry for the noise.

 
  I don't think so. And the code quality when checking for
  overflows was abysmal last time I tried.
 
 We're talking about range checking here, not arithmetic overflow
 checking (which is another topic where GCC infrastructure change
 could help Ada of course).

That would be a huge undertaking.

Regards,
Gabriel

Re: help for arm avr bfin cris frv h8300 m68k mcore mmix pdp11 rs6000 sh vax

2009-03-17 Thread Gabriel Paubert

On Fri, Mar 13, 2009 at 06:06:41PM +0100, Paolo Bonzini wrote:
 
  Hm.  In fold-const.c we try to make sure to produce the same result
  as the target would for constant-folding shifts.  Thus, Paolo, I think
  what fold-const.c does is what we should assume for
  !SHIFT_COUNT_TRUNCATED.  No?
  Unfortunately it is not so simple.  fold-const.c is actually wrong, as
  witnessed by this program
 
   static inline int f (int s) { return 2  s; }
   int main () { printf (%d\n, f(33)); }
 
  which prints 4 at -O0 and 0 at -O2 on i686-pc-linux-gnu.
  
  But this is because i?86 doesn't define SHIFT_COUNT_TRUNCATED, no?
 
 Yes, so fold-const.c is *not* modeling the target in this case.
 
 But on the other hand, this means we can get by with documenting the
 effect of a conservative truncation mask: no wrong code bugs, just
 differences between optimization levels for undefined programs.  I'll
 check that the optimizations done based on the truncation mask are all
 conservative or can be made so.
 
 So, I'd still need the information for arm and m68k, because that
 information is about the bitfield instructions.  For rs6000 it would be
 nice to see what they do for 64-bits (for 32-bit I know that PowerPCs
 truncate to 6 bits, not 5).  

For 64 bit variable rotate/shifts, the count is truncated to 7 bits.
That's consistent on rs6000 in order to simplify the implementation 
of double precision shifts (the example code is in some appendices 
of the architecture books).

Gabriel

Re: PowerPC lwsync Instruction

2008-06-23 Thread Gabriel Paubert

On Thu, Jun 19, 2008 at 03:50:34PM -0500, Joel Sherrill wrote:
 Andrew Pinski wrote:
 On Thu, Jun 19, 2008 at 1:36 PM, Joel Sherrill
 [EMAIL PROTECTED] wrote:
   
 Hi,
 
 I ran into something tracking down a test
 failure on psim and now thing there is a
 target specific issue that needs addressing.
 
 
 lwsync is sync with the bit 9 set.  So it should be ok as it was a
 reserved field and was supposed to be ignored on the hardware which
 did not implement those bits and have it as a sync (but I could be
 wrong).
   
 I don't have access to a real 603e of this vintage but
 my Sept 1995 603e User's Manual shows the sync
 instruction has having:
 
 0-5  - all 1's (value in table is 31)
 6-20   - all 0's (dark grey indicating not implemented)
 21-30 - 598
 31   - 0
 

I have 6 PPC603ev (5 at revision 2.1 and one at 18.1
according to /proc/cpuinfo, some of them running almost 
nonstop for 11 years, all of them for 7 years at least) 
and they all accept an: 

asm volatile(lwsync : : : memory);

between two printf() without trapping  (I see the output of
the second printf).

I also tried with ptesync, which is also accepted.

Regards,
Gabriel

Re: [PATCH][4.3] Deprecate -ftrapv

2008-03-03 Thread Gabriel Paubert

On Mon, Mar 03, 2008 at 01:38:01AM +0100, Andi Kleen wrote:
 [EMAIL PROTECTED] (Ross Ridge) writes:
 
  Robert Dewar writes:
  Yes, and that is what we would want for Ada, so I am puzzled by your
  sigh. All Ada needs to do is to issue a constraint_error exception,
  it does not need to know where the exception came from or why except
  in very broad detail.
  
  Unless printing This application has requested the Runtime to terminate
  it in an unusual way. counts an issuing a contraint_error in Ada,
  it seems to me that -ftrapv and Ada have differing requirements.
  How can you portabilty and correctly generate a constraint_error if
  the code generated by -ftrapv calls the C runtime function abort()?
  On Unix-like systems you can catch SIGABRT, but even there how do you
  tell that it didn't come from CTRL-\, a signal sent from a different
  process, or abort() called fom some other context?  With INTO I don't
  see any way distignuish the SIGSEGV it generates on Linux from any of
  the myriad other ways a SIGSEGV can be generated.
 
 Easy: The signal frame that is passed as an argument to the signal handler 
 has a trapno member than will contain 4 for INTO. The only other case where 
 it would contain 4 would be a explicit int 4

Yes, but it seems that INTO has been removed for 64 bit mode. In this
case the best solution is probably to insert conditional jumps (jo)
both in 32 and 64 bit mode for uniformity. PPC can only use conditional
jumps too, although the easily testable overflow bit is sticky so you 
don't need a test after every instuction. 

The code with conditional jumps is bigger but less dependant on the 
OS environment or of any user code trying to install its own signal 
handlers (especially for SIGSEGV which multiplexes so many exception
causes). The performance impact is probably small since the jumps
will normally be correctly predicted as not taken.

Gabriel

Re: powercp-linux cross GCC 4.2 vs GCC 4.0.0: -Os code size regression?

2008-01-17 Thread Gabriel Paubert

On Wed, Jan 16, 2008 at 04:55:19PM +0300, Sergei Poselenov wrote:
 Hello,
 
 I've just noted an error in my calculations: not 40%, but 10%
 regression (used gdb to do the calculations and forgot to convert
 inputs to float). Sorry.
 
 But the problem still persists for me - I'm building an embedded
 firmware (U-Boot) and it doesn't fit into the reserved space
 anymore.
 
[snipped]

 As for the CSiBE results - the average regression is
 3%, including top 3 winners:
 100% (32768 vs 16384 for linux-2.4.23-pre3-testplatform - 
 arch/testplatform/kernel/init_task)

A change from an exact power of 2 to the next one looks very
suspiscious: I seriously doubt that it is a code generation
or instruction choice issue. While there might be a relatively
small increase in size inherent to the compiler, it looks like 
it then goes to a round to the next power of 2 step.

Do you set the right options for your particular processor
(-Os might not override some scheduling decisions and the
default target processor might have changed between GCC
releases)?

Regards,
Gabriel

Re: powercp-linux cross GCC 4.2 vs GCC 4.0.0: -Os code size regression?

2008-01-17 Thread Gabriel Paubert


Hello Sergei,

On Thu, Jan 17, 2008 at 03:13:59PM +0300, Sergei Poselenov wrote:
 I don't know now, actually, this is what I'm asking. As for the
 target processor - as I stated in the initial message:
 
 ...
 Currently, it builds as following:
 ppc-linux-gcc -g -Os -fPIC -ffixed-r14 -meabi -fno-strict-aliasing 
 -D__KERNEL__ -DTEXT_BASE=0xfffc -I/work/psl/tmp/u-boot/include 
 -fno-builtin -ffreestanding -nostdinc -isystem 
 /opt/eldk-4.2-01-08/usr/bin/../lib/gcc/powerpc-linux/4.2.2/include 
 -pipe -DCONFIG_PPC -D__powerpc__ -DCONFIG_4xx -ffixed-r2 -ffixed-r29 
 -mstring -msoft-float -Wa,-m440 -mcpu=440 -DCONFIG_440=1 -Wall 
 -Wstrict-prototypes -c -o interrupts.o interrupts.c
 ...
 
 Note the -mcpu=440 switch.

Doh, I missed this, sorry.

 
 I removed all -ffixed option (just for test - we surely need them)
  - it doesn't change the size of the resultant gcc-4.2.2 code.

I'm not sure that having -ffixed-r29 is a wise choice when you are
looking for small code size. It might prevent the use of load/store 
multiple in prologue and epilogue code.

Regards,
Gabriel

Re: powercp-linux cross GCC 4.2 vs GCC 4.0.0: -Os code size regression? [Emcraft #11717]

2008-01-17 Thread Gabriel Paubert

On Thu, Jan 17, 2008 at 05:48:10PM +0300, Sergei Poselenov wrote:
 Hello Andrew,
 
 Andrew Haley wrote:
 Sergei Poselenov writes:
   Hello Andrew,
   
Now, I sympathize that in your particular case you have a code size
regression.  This happens: when we do optimization in gcc, some code
bases will lose out.  All that we can promise is that we try not to
make it worse for most users.

What we can do is compare your code that has got much worse, and try
to figure out why.

   
   Would the generated asm listings be enough? Or should I send
   the preprocessed sources as well?
 
 Both.
 
 Rather than sending stuff, best to stick it on a web site if you can.
 
 
 Here it is:
 Preprocessed and assembler code generated by the GCC 4.2.2 ppc-linux
 cross-compiler:
 http://www.emcraft.com/codesize/gcc-4.2.2/interrupts.i
 http://www.emcraft.com/codesize/gcc-4.2.2/interrupts.s
 
 The same code built with gcc-4.0.0 cross-compiler:
 http://www.emcraft.com/codesize/gcc-4.0.0/interrupts.i
 http://www.emcraft.com/codesize/gcc-4.0.0/interrupts.s
 

The functions do not appear in the same order in both files, it's a
bit surprising! Anyway look for example at irq_install_handler:

- gcc-4.0 saves all registers using stmw r24,xx(r1) and restores them
with lmw r24,xx(r1) however this means that r29 is overwritten in 
the epilogue.

- gcc-4.2.2 saves and restores registers individually which
means that it takes 12 more instructions. There go 48 bytes.

This is especially visible in the epilogue (in the prologue
the saves are interspersed with other instructions).

In this case -ffixed-r29 hurts, but gcc4.2.2 looks more correct.

Regards,
Gabriel

Re: Git and GCC

2007-12-10 Thread Gabriel Paubert

On Fri, Dec 07, 2007 at 04:47:19PM -0800, Harvey Harrison wrote:
 Some interesting stats from the highly packed gcc repo.  The long chain
 lengths very quickly tail off.  Over 60% of the objects have a chain
 length of 20 or less.  If anyone wants the full list let me know.  I
 also have included a few other interesting points, the git default
 depth of 50, my initial guess of 100 and every 10% in the cumulative
 distribution from 60-100%.
 
 This shows the git default of 50 really isn't that bad, and after
 about 100 it really starts to get sparse.  

Do you have a way to know which files have the longest chains?

I have a suspiscion that the ChangeLog* files are among them,
not only because they are, almost without exception, only modified
by prepending text to the previous version (and a fairly small amount
compared to the size of the file), and therefore the diff is simple
(a single hunk) so that the limit on chain depth is probably what
causes a new copy to be created. 

Besides that these files grow quite large and become some of the 
largest files in the tree, and at least one of them is changed 
for every commit. This leads again to many versions of fairly 
large files.

If this guess is right, this implies that most of the size gains
from longer chains comes from having less copies of the ChangeLog*
files. From a performance point of view, it is rather favourable
since the differences are simple. This would also explain why
the window parameter has little effect.

Regards,
Gabriel

[OT] Re: Git repository with full GCC history

2007-06-01 Thread Gabriel Paubert

On Fri, Jun 01, 2007 at 02:52:43AM -0400, Bernardo Innocenti wrote:
 Harvey Harrison wrote:
 
 Was this repo made with svnimport or git-svn? svnimport is faster but
 chooses bad delta bases as a result.  git repack -a -d -f would allow
 git to choose better deltas rather than reusing the deltas that
 svnimport created.
 
 I used:
 
 git-svn fetch
 git-fetch . remotes/git-svn
 
 
 Yes, I did a git-repack -a -d -f too.  And I even did
 one with --window=20, but nothing changed.
 
 
 (I think, I'm not a git expert).
 
 Neither am I, but after all, who is?  (Linus, you don't count)
 
 
 What version of git did you use? 1.5.0.6 here.
 
 1.5.2
 
 I shall try it...  That's probably it.
 

This may be the pack depth which was increased to 50 according
to 1.5.2 release notes:

  - The default pack depth has been increased to 50, as the
 recent addition of delta_base_cache makes deeper delta chains
 much less expensive to access.  Depending on the project, it was
 reported that this reduces the resulting pack file by 10%
 or so.

I'm almost certain that the savings will be much larger than 10% for 
some files, for example the ChangeLogs.

BTW, there is a strange line in the current ChangeLog, between May 30th
and May 31st entries:  .r125234. Is it just me, a subversion
glitch or something else?

Gabriel

Re: Git repository with full GCC history

2007-06-01 Thread Gabriel Paubert

On Fri, Jun 01, 2007 at 04:47:11AM -0400, Bernardo Innocenti wrote:
 Jan-Benedict Glaw wrote:
 On Thu, 2007-05-31 21:34:33 -0400, Bernardo Innocenti [EMAIL PROTECTED] 
 wrote:
 I've set up a Git mirror of the entire GCC history on
 server space kindly provided by David Woodhouse.
 
 You can clone it with:
 
 git-clone git://git.infradead.org/gcc.git
 
 How often will it be synced with upstream SVN?
 
 I've setup a cron job every hour, but I can increase the
 frequency if needed.  git-svn is not a cpu/bandwidth hog.
 
 While you're at it,
 would David mind to also place a binutils, glibc and glibc-ports GIT
 repo next to it?  That way, there would be a nice single point of GIT
 repos for the whole toolchain.
 
 For this, I'd prefer waiting for David's answer.  David,
 my guess is that all of these combined should be smaller
 than GCC alone.  There should be fewer users, too.
 
 
 Thanks for the work, I'll just clone it right now :)
 
 Be our guest, and let me know if you find a way to
 repack the repo to a smaller size.

I just upgraded my git to 1.5.2 and repacked the git repository
with git-gc --aggressive. It is quite impressive: the size of 
the pack file was almost cut in half, from ~23MB to ~12MB!

Gabriel

[OT] Re: Git repository with full GCC history

2007-06-01 Thread Gabriel Paubert

On Fri, Jun 01, 2007 at 11:00:29AM -0400, Bernardo Innocenti wrote:
 Gabriel Paubert wrote:
 
 I just upgraded my git to 1.5.2 and repacked the git repository
 with git-gc --aggressive. It is quite impressive: the size of 
 the pack file was almost cut in half, from ~23MB to ~12MB!
 
 The --aggressive option is undocumented in 1.5.2.  What
 is it supposed to do?
 

It is documented in my freshly compiled and installed git:

   --aggressive
  Usually git-gc runs very quickly while providing good disk space
  utilization and performance. This option will cause git-gc to more
  aggressive optimize the repository at the expense of taking much
  more time. The effects of this optimization are persistent, so this
  option only needs to be sporadically; every few hundred changesets
  or so.

Regards,
Gabriel

Re: Signed into overflow behavior in the security context

2007-01-30 Thread Gabriel Paubert

On Tue, Jan 30, 2007 at 10:49:02AM -0500, Robert Dewar wrote:
 Paul Schlie wrote:
 
 - as trap representation within the context of C is a value
 representation which is not defined to be a member of a type, where if
 accessed or produced evokes undefined behavior; so admit as to the best of
 my knowledge all potentially enclosable values for IEEE floats and doubles
 are defined, it would seem trap representations don't exist in typical fp
 implementations, as such an implementation would require more bits of
 encoding than the type itself requires.
 
 You don't think a signalling NaN is a trap value? 

I do.

 The existence of
 such values (which do indeed cause a trap if loaded), 

If loaded? This is a very approximate description. 

For all architectures I use, it is rather if used as an operand 
of an arithmetic instruction, but the values can be copied 
around without ever generating a trap. Even negating or taking 
the absolute value never traps since those are not considered 
arithmetic instructions.

And even then you have to explicitly enable the trap for invalid
operation (on systems using IEEE754), otherwise it is simply 
propagated as a QNaN.

I remember that the VAX had separate instructions for moving floats 
and ints (of the same size) and the only difference between them was
that move floating point instruction would trap on the reserved
operand value (zero exponent and sign bit set). However compilers 
did not actually use the floating point move instructions (probably 
for performance reasons).

Gabriel

Re: [OT] char should be signed by default

2007-01-25 Thread Gabriel Paubert

On Thu, Jan 25, 2007 at 10:29:29AM +0100, Paolo Bonzini wrote:
 
 A given program is written in one or the other of these two dialects.
 The program stands a chance to work on most any machine if it is
 compiled with the proper dialect. It is unlikely to work at all if
 compiled with the wrong dialect.
 
 It depends on the program, and whether or not chars in the user's
 character set is sign extended (ie, in the USA, you likely won't notice
 a difference between the two if chars just hold character values).
 
 You might notice if a -1 (EOF) becomes a 255 and you get an infinite 
 loop in return (it did bite me).  Of course, this is a bug in that 
 outside the US a 255 character might become an EOF.

That'a a common bug with getchar() and similar function because people
put the result into a char before testing it, like:

char c;
while ((c=getchar())!=EOF) {
...
}

while the specification of getchar is that it returns an unsigned char 
cast to an int or EOF, and therefore this code is incorrect independently 
of whether char is signed or not:
- infinite loop when char is unsigned
- incomplete processing of a file because of early detection of EOF 
  when char is signed and you hit a 0xFF char.

I've been bitten by both (although the second one is less frequent now
since 0xff is invalid in UTF-8).

BTW, I'm of the very strong opinion that char should have been unsigned
by default because the name itself implies that it is used as a 
enumeration of symbols, specialized to represent text. When you step
from one enum value to the following one (staying within the range of
valid values), you don't expect the new value to become lower than the 
preceding one.

Things would be very different if it had been called byte or 
short short int instead.

Gabriel

Re: Miscompilation of remainder expressions

2007-01-17 Thread Gabriel Paubert

On Wed, Jan 17, 2007 at 12:43:40AM +0100, Vincent Lefevre wrote:
 On 2007-01-16 21:27:42 +, Andrew Haley wrote:
  Ian Lance Taylor writes:
I suspect that the best fix, in the sense of generating the best
code, would be to do this at the tree level.  That will give loop
and VRP optimizations the best chance to eliminate the test for -1.
Doing it during gimplification would be easy, if perhaps rather
ugly.  If there are indeed several processors with this oddity,
then it would even make a certain degree of sense as a
target-independent option.
  
  x86, x86-64, S/390, as far as I'm aware.
 
 and PowerPC G4 and G5, where I don't get a crash, but an incorrect
 result (as said on PR#30484).
 

On PPC, the solution is to use divo. [1] followed by an unlikely 
conditional branch to out of line code to handle the corner cases.

The question is: what do we do in the case of a divide by zero on PPC?
Are there other architectures that do not trap?

Gabriel

[1] sadly gcc does not know about the overflow flag and (unless it has
improved greatly since the last time I checked on a small ADA program)
generates bloated and slow code when checking for overflow. This is not 
specific to the rs6000 backend.

Re: Miscompilation of remainder expressions

2007-01-17 Thread Gabriel Paubert

On Wed, Jan 17, 2007 at 11:17:36AM -0800, Ian Lance Taylor wrote:
 Joe Buck [EMAIL PROTECTED] writes:
 
  On Wed, Jan 17, 2007 at 05:48:34PM +, Andrew Haley wrote:
   From a performance/convenience angle, the best place to handle this is
   either libc or the kernel.  Either of these can quite easily fix up
   the operands when a trap happens, with zero performance degradation of
   existing code.  I don't think there's any need for gcc to be altered
   to handle this.
  
  How will the kernel know whether the overflow in the divide instruction
  is because the user's source code has a '%' and not a '/'?  We generate
  the exact same instruction for i / minus_one(), after all, and in that
  case the trap really should be there.
 
 We don't need to generate a trap for INT_MIN / -1.  That is undefined
 signed overflow.  We can legitimately set the quotient register to
 INT_MIN while setting the remainder register to zero.  (Hmmm, what
 should we do if -ftrapv is set?  Probably generate a different code
 sequence in the compiler.)
 
 We do want to generate a trap for x / 0, of course.
 

Then you have to fix the code generation for PPC, which never traps.
All (?) 3-register arithmetic instructions have the option to
set an overflow flag that you can check later.

Gabriel

Re: Miscompilation of remainder expressions

2007-01-17 Thread Gabriel Paubert

On Wed, Jan 17, 2007 at 04:15:08PM -0800, Ian Lance Taylor wrote:
 Robert Dewar [EMAIL PROTECTED] writes:
 
  Ian Lance Taylor wrote:
  
   We do want to generate a trap for x / 0, of course.
  
  Really? Is this really defined to generate a trap in C?
  I would be surprised if so ...
 
 As far as I know, but I think it would be a surprising change for x /
 0 to silently continue executing.
 

That's exactly what happens on PPC.

 But perhaps not a very important one.

Indeed.

Gabriel

Re: Miscompilation of remainder expressions

2007-01-15 Thread Gabriel Paubert

On Mon, Jan 15, 2007 at 10:34:23PM +0200, Michael Veksler wrote:
 Roberto Bagnara wrote:
 
 Reading the thread Autoconf manual's coverage of signed integer
 overflow  portability I was horrified to discover about GCC's
 miscompilation of the remainder expression that causes INT_MIN % -1
 to cause a SIGFPE on CPUs of the i386 family.  Are there plans to
 fix this bug (which, to me, looks quite serious)?
 All the best,
 
 This problem is quite rare in practice (otherwise there would be
 much more complaining). As such it may be too expensive,
 performance-wise, to fix in GCC. It seems as one of those
 classical things that can be worked-around in the kernel.
 
 Once the kernel sees the FP trap (whatever its i368 name is),
 it decodes the machine code and finds:
 idivl  (%ecx).
 As far as I remember, this will put the result in two registers
 one for div_res and one for mod_res.
 
 Since MIN_INT/-1 is undefined, the kernel may put MIN_INT
 in div_res, and mod_res=1. Then return to the following instruction.
 
 Should I open a request for the kernel?

No, because the instruction has actually two result values:

- the remainder, which you could safely set to zero (not 1!)

- the quotient, which is affected by the overflow and there may be
  compiler and languages that rely on the exception being generated.

The kernel cannot know whether you are going to use
the quotient or not by simply decoding the instruction.

Actually I believe that a%b and a%(-b) always return the same value
if we follow the C99 specification. So if you are only interested
in the remainder, you can simply use a%abs(b) instead of a%b. The
overhead of abs() is really small.  

Gabriel

Re: optimizing calling conventions for function returns

2006-05-23 Thread Gabriel Paubert

On Tue, May 23, 2006 at 11:21:46AM -0400, Jon Smirl wrote:
 Has work been done to evaluate a calling convention that takes error
 checks like this into account? Are there size/performance wins? Or am
 I just reinventing a variation on exception handling?

It's fairly close to Fortran alternate return labels, which 
were standard in Fortran 77 but have been declared obsolescent
in later revisions of the standard.

Regards,
Gabriel

Re: Qemu and GCC-3.4 on powerpc

2006-04-10 Thread Gabriel Paubert

On Sun, Apr 09, 2006 at 02:45:04PM +0200, Dieter Schuster wrote:
 Tach auch!
 
 Am Fr, den 31 März 2006, schrieb Alan Modra:
  On Tue, Mar 28, 2006 at 12:00:47PM +0200, Gabriel Paubert wrote:
   On Tue, Mar 28, 2006 at 12:56:13AM +0200, Dieter Schuster wrote:
If I try to compile qemu with GCC 3.4 without the patch I get the 
following error:

qemu-0.8.0/linux-user/elfload.c: In function `load_elf_binary':
qemu-0.8.0/cpu-all.h:253: error: inconsistent operand constraints in an 
`asm'
qemu-0.8.0/cpu-all.h:253: error: inconsistent operand constraints in an 
`asm'
   
   Weird. CC'ed to gcc list despite the fact that the 3.4 branch
   is definitely closed. I've not found anything remotely similar
   from bugzilla.
   

But if I copy the function stl_le_p to a seperate file, the function
will compile with GCC 3.4. 
  
  Check preprocessor output.  My guess is that you have some unexpected
  substitution.
  
 
 I had now more time, to investigate the error. It seems to be a
 optimization problem. With -O2 -fno-gcse the error disappeared. I have
 made a bug report to gcc. 

Yes, but you sent it to gnats-gcc, which is completely obsolete. You
should use bugzilla. Now the testcase looks really minimal:

http://lists.debian.org/debian-gcc/2006/04/msg00135.html

even if the code looks strange (redundant cast, if without 
braces which confused me at first). This may be the result
of normal macro expansion.

However, I believe that the most serious problem is that you 
use uninitialized local variables (which makes the code invalid). 
I have changed the testcase to the attached file by declaring 
these variables extern (the code is also reformatted to be more 
readable).

Testing here with recent Debian versions of gcc-3.4, 4.0 and 4.1
show that I can' trigger any problem with 4.0 and 4.1, and that 
for gcc-3.4:
- -O1 always works (regardless of -fgcse), which is surprising 
  since I have exactly the same compiler as you.
- -O2 -fno-gcse works
- -O2 (implies -fgcse) fails

In the failure case, the generated code (attached) does not make 
sense: one stwbrx disappears and is mysteriously replaced by an 
stw, the source of which is a register which has never been 
initialized (r8), while I don't see any uninitialized variable
in the source after the small changes I've made.

Regards,
Gabriel
static inline void stl(void *ptr, int v) {
__asm__ __volatile__ (stwbrx %1,0,%2 : =m (*(unsigned long *)ptr) : r 
(v), r (ptr));
}

int main () {
  extern unsigned long *sp, *u_platform;
  extern char *k_platform;

  stl(sp, (unsigned long)(0)); 
  stl(sp+1, (unsigned long)(0));

  if (k_platform)
stl(sp, (unsigned long)(15)); 

  stl(sp+1, (unsigned long)((unsigned long) u_platform));
  return 0;
}
.file   bug.i
.section.text
.align 2
.globl main
.type   main, @function
main:
lis 11,[EMAIL PROTECTED]
lis 9,[EMAIL PROTECTED]
lwz 0,[EMAIL PROTECTED](11)
lwz 10,[EMAIL PROTECTED](9)
li 9,15
cmpwi 7,0,0
beq- 7,.L7
#APP
stwbrx 9,0,10
#NO_APP
lis 9,[EMAIL PROTECTED]
addi 11,10,4
lwz 0,[EMAIL PROTECTED](9)
#APP
stwbrx 0,0,11
#NO_APP
li 3,0
blr
.L7:
lis 9,[EMAIL PROTECTED]
stw 8,0(10)
addi 11,10,4
lwz 0,[EMAIL PROTECTED](9)
#APP
stwbrx 0,0,11
#NO_APP
li 3,0
blr
.size   main,.-main
.section.note.GNU-stack,,@progbits
.ident  GCC: (GNU) 3.4.6 (Debian 3.4.6-1)

Re: Qemu and GCC-3.4 on powerpc

2006-03-28 Thread Gabriel Paubert

On Tue, Mar 28, 2006 at 12:56:13AM +0200, Dieter Schuster wrote:
 Hello,
 
 the version 0.8.0 of qemu in the Debian-pool will not compile on
 PowerPC with GCC 3.4. The following patch will fix it:

And suck performance wise with exploding code size. Without 
speaking of potential atomicity issues (although all architectures 
using the byte per byte generic code may hit the problem).

 
 --- cpu-all.h~2005-12-19 23:51:53.0 +0100
 +++ cpu-all.h 2006-03-27 22:47:54.291613000 +0200
 @@ -249,15 +249,11 @@
  
  static inline void stl_le_p(void *ptr, int v)
  {
 -#ifdef __powerpc__
 -__asm__ __volatile__ (stwbrx %1,0,%2 : =m (*(uint32_t *)ptr) : r 
 (v), r (ptr));
 -#else
  uint8_t *p = ptr;
  p[0] = v;
  p[1] = v  8;
  p[2] = v  16;
  p[3] = v  24;
 -#endif
  }
  
  static inline void stq_le_p(void *ptr, uint64_t v)
 
 
 If I use GCC 3.3, then qemu compiles with the assembler instruction in
 the patch above, but qemu does not work correctly (tested with Knoppix V5.0).

Interesting, could it be an aliasing problem?

Try to compile with -fno-strict-aliasing, although 
I doubt it will change anything. 

 
 If I try to compile qemu with GCC 3.4 without the patch I get the following 
 error:
 
 qemu-0.8.0/linux-user/elfload.c: In function `load_elf_binary':
 qemu-0.8.0/cpu-all.h:253: error: inconsistent operand constraints in an `asm'
 qemu-0.8.0/cpu-all.h:253: error: inconsistent operand constraints in an `asm'

Weird. CC'ed to gcc list despite the fact that the 3.4 branch
is definitely closed. I've not found anything remotely similar
from bugzilla.

 
 But if I copy the function stl_le_p to a seperate file, the function
 will compile with GCC 3.4. 

Which gcc-3.4 (gcc -v)? 

 
 Is this a bug in qemu, or is it a bug in GCC 3.4?

It looks like a compiler bug. But you should
provide an environment independent test case,
i.e., the preprocessed source. Please try
also to provide a minimal failing test case 
(it may be hard given the symptoms).

Regards,
Gabriel

Re: Request for 48 hours of just regression/bug fixes

2006-01-22 Thread Gabriel Paubert

On Sat, Jan 21, 2006 at 07:03:27PM -0800, Mark Mitchell wrote:
 Andrew Pinski wrote:
  I noticed today that there were three projects which were merged into
  the mainline within a 24 hour period yesterday.

  Date: Thu, 19 Jan 2006 01:42:49 -  IAB  - Daniel Berlin
  Date: Thu, 19 Jan 2006 10:24:04 -  Vect - Dorit
  Date: Thu, 19 Jan 2006 16:55:54 -  GOMP - Diego Novillo

  So I am requesting that we go through a 48 hour period starting Monday
  (as the weekends are usually quiet for patch committing) for a stage 3
  type regression only/bug fixes.

 I'm inclined to agree.  Any objections?

Monday in Hawaii is basically Tuesday in New-Zealand
(23 hours difference right now), so you should make
clear in which timezone you define the 48 hour period.

Apart from this I have nothing to say about the convenience
of such a freeze, I'm mostly a lurker who regularly bootstraps 
GCC (once a week or so) but have not had any problem in a long 
time compiling my own rather simple code.

I'd just like to take the opportunity to thank all the 
people who make this great compiler available.

Regards,
Gabriel

Re: 4.2 Project: @file support

2005-08-25 Thread Gabriel Paubert

On Thu, Aug 25, 2005 at 06:09:25PM +0200, Florian Weimer wrote:
 * Andi Kleen:
 
  Linux has a similar limit which comes from the OS (normally around 32k) 
  So it would be useful there for extreme cases too.
 
 IIRC, FreeBSD has a rather low limit, too.  And there were discussions
 about command line length problems in the GCC build process on VMS.

For the record, the @file technique is a pretty standard way of
providing long command lines under VMS since the late seventies
or so. Many of our link scripts use it.

However, I suspect that the @file parsing only happens for
commands known to DCL, not for foreign commands.

Gabriel

Re: signed is undefined and has been since 1992 (in GCC)

2005-06-28 Thread Gabriel Paubert

On Tue, Jun 28, 2005 at 02:32:04PM +0200, Gabriel Dos Reis wrote:
 Robert Dewar [EMAIL PROTECTED] writes:
 
 | Gabriel Dos Reis wrote:
 | 
 |  The issue here is whether if the hardware consistently display a
 |  semantics, GCC should not allow access to that consistent semantics
 |  under the name that the standard says it is undefined behaviour.
 |  Consider the case of converting a void* to a F*, where F is a function
 |  type.
 | 
 | Well the hardware consistently displaying a semantics is not so
 | cut and dried as you think (consider the loop instruction and other
 | arithmetic on the x86 for instance in the context of generating code
 | for loops).
 
 Please do remember that this is hardware dependent.  If you have
 problems with x86, it does not mean you have the same witha PPC or a
 Sparc. 

For the matter, PPC also has undefined behaviour for integer divides
of 0x8000 by -1 (according to the architecture specification).
I just checked on a 400MHz PPC750, and the result register ends
up containing -1. 

A side effect is that (INT_MIN % -1) is INT_MAX, which is really 
surprising. I believe that it is reasonable to expect that the 
absolute value of x%y is less than the absolute value of y; it 
might even be required by some language standard.

On x86, the same operation results in a divide by zero exception
(vector 0) and a signal under most (all?) operating systems
(SIGFPE under Linux).

Now in practice what would be the cost of checking that the divisor
is -1 and take an alternate path that computes the correct 
results (in modulo arithmetic) for this case ? 

I can see a moderate code size impact, something like 4 or 5 machine
instructions per integer division, not really a performance impact
since on one branch you would have a divide instruction which takes
many clock cycles.

Regards,
Gabriel

Re: Bug related to floating-point traps?

2005-06-15 Thread Gabriel Paubert

On Wed, Jun 15, 2005 at 03:14:59PM +0200, Vincent Lefevre wrote:
 I don't know if this is a bug in gcc or the glibc... Consider the
 following program traps1:
 
 #define _GNU_SOURCE
 #include stdio.h
 #include stdlib.h
 #include float.h
 #include fenv.h
 
 int main (int argc, char *argv[])
 {
   volatile long double x, y = 0.0;
 
   if (argc != 3)
 {
   fprintf (stderr, Usage: exception double flag\n);
   exit (1);
 }
 
   if (fesetround (FE_DOWNWARD))
 {
   fprintf (stderr, Can't set rounding mode to FE_DOWNWARD\n);
   exit (1);
 }
 
   x = atof (argv[1]);
   x *= LDBL_MAX;
   printf (x = %Lg\n, x);
   feenableexcept (FE_OVERFLOW);
   if (atoi (argv[2]))
 y += 0.0;
   return 0;
 }
 
 
 I get the following results on X86 and PowerPC processors with gcc 4.0
 (Debian):
   * x86, traps1 2 0  - x = 1.18973e+4932
   * x86, traps1 2 1  - ditto with floating point exception signal
   * ppc, both cases  - x = 1.79769e+308 with FPE signal
 
 I don't think one should get floating-point exception signals, and in
 any case, the results between both processors seem to be inconsistent.

The exception flags are sticky. So you get the exception because
operations before feenableexcept set the overflow flag.

The  difference is that the PPC triggers the exception as soon
as you unmask the exception while the x86 only checks for
pending exceptions at the beginning of an arithmetic FP instruction. 

That's the dreadful x87 delayed exception mechanism, which 
would require compilers to add fwait to handle exceptions 
in the context in which they occur and not some time later.
For example if the last FP instruction before the return 
from a function produced an exception, the exception will be 
taken whe the caller tries to use the result.

On x86, SSE exception handling is much saner.

Gabriel

51 matches

Mail list logo