[Bug c++/109963] ABI: unexpected layout ordering of `this` pointer in lambda capture

2023-09-26 Thread justin.lebar+bug at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109963

Justin Lebar  changed:

   What|Removed |Added

 CC||justin.lebar+bug at gmail dot 
com

--- Comment #2 from Justin Lebar  ---
I rediscovered this in https://github.com/openai/triton/issues/2398 (along with
Mehdi, coincidentally).

I'm really not sure how we fix this, though.  If gcc changes their ABI, then
new GCC binaries will not be compatible with old ones.  OTOH if nobody changes
their ABI, then we can't pass lambdas between binaries created by the different
compilers.

[Bug target/43052] Inline memcmp is *much* slower than glibc's

2011-07-04 Thread justin.lebar+bug at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

--- Comment #13 from Justin Lebar justin.lebar+bug at gmail dot com 
2011-07-04 14:40:40 UTC ---
(In reply to comment #12)
 Created attachment 24670 [details]
 memcpy/memset testing script
 
 HJ,
 can you please run the attached script with new glibc as 
 sh test_stringop 64 64000 gcc -march=native | tee out

Do you think you could spin a script which also tests memcmp?


[Bug target/43052] Inline memcmp is *much* slower than glibc's

2011-07-04 Thread justin.lebar+bug at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

--- Comment #14 from Justin Lebar justin.lebar+bug at gmail dot com 
2011-07-04 15:00:36 UTC ---
Created attachment 24676
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=24676
Test results from my core i7


[Bug target/43052] Inline memcmp is *much* slower than glibc's

2011-06-13 Thread justin.lebar+bug at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

Justin Lebar justin.lebar+bug at gmail dot com changed:

   What|Removed |Added

 CC||justin.lebar+bug at gmail
   ||dot com

--- Comment #8 from Justin Lebar justin.lebar+bug at gmail dot com 2011-06-13 
18:09:07 UTC ---
I just did some tests, and on my machine, glibc's memcmp is faster even when
the size of the thing we're comparing is 4 bytes.  I can't point to a case that
this optimization speeds up on my machine.


[Bug target/43052] Inline memcmp is *much* slower than glibc's

2011-06-13 Thread justin.lebar+bug at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

--- Comment #10 from Justin Lebar justin.lebar+bug at gmail dot com 
2011-06-13 18:18:13 UTC ---
Can I force gcc not to use its inlined version?


[Bug target/46357] Unnecessary movzx instruction

2010-11-08 Thread justin.lebar+bug at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46357

--- Comment #2 from Justin Lebar justin.lebar+bug at gmail dot com 2010-11-08 
17:08:36 UTC ---
(In reply to comment #1)
 We always use zero/sign-extending moves to avoid partial register stalls.

Sure, but the whole instruction on line 10 is unnecessary.


[Bug c/46357] New: Unnecessary movzx instruction

2010-11-07 Thread justin.lebar+bug at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46357

   Summary: Unnecessary movzx instruction
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: justin.lebar+...@gmail.com


Originally reported to the gcc-help list.

Tested with gcc Ubuntu/Linaro 4.5.1-7ubuntu2, but I get the same code with gcc
4.4.

The following C code generates assembly code with what appears to be an
unnecessary call to movzx:

  char skip[] = { /* ... */ };

  int foo(const unsigned char *str, int len)
  {
int result = 0;
int i = 7;

while (i  len) {
  if (str[i] == '_'  str[i-1] == 'D') {
result |= 2;
  }
  i += skip[str[i]];
}

return result;
  }

 foo:
  0:   31 c0   xoreax,eax
  2:   83 fe 07cmpesi,0x7
  5:   ba 07 00 00 00  movedx,0x7
  a:   7f 14   jg 20 foo+0x20
  c:   eb 32   jmp40 foo+0x40
  e:   66 90   xchg   ax,ax

// Beginning of loop

 10:   0f b6 c9movzx  ecx,cl
 13:   0f be 89 00 00 00 00movsx  ecx,BYTE PTR [rcx+0x0]
 1a:   01 ca   addedx,ecx
 1c:   39 d6   cmpesi,edx
 1e:   7e 20   jle40 foo+0x40
 20:   4c 63 c2movsxd r8,edx
 23:   42 0f b6 0c 07  movzx  ecx,BYTE PTR [rdi+r8*1]
 28:   80 f9 5fcmpcl,0x5f
 2b:   75 e3   jne10 foo+0x10

// Likely end of loop (i.e. branch above is likely taken)

 2d:   41 89 c1movr9d,eax
 30:   41 83 c9 02 or r9d,0x2
 34:   41 80 7c 38 ff 44   cmpBYTE PTR [r8+rdi*1-0x1],0x44
 3a:   41 0f 44 c1 cmove  eax,r9d
 3e:   eb d0   jmp10 foo+0x10
 40:   f3 c3   repz ret


The movzx on line 10 sets everything except the least-significant bit of ecx to
zero.  This is unnecessary since line 23 dominates line 10, so we're guaranteed
that ecx contains zeros everywhere except in its least-significant bit by the
time we get to line 10.

If I change |str| in the C code to a signed char, then line 10 becomes movsx
(now a necessary instruction). Perhaps this gives a hint as to where the errant
instruction is coming from.


[Bug web/46031] New: Atomic Builtins page should indicate that 16-byte compare-and-swap is available with -mcex16

2010-10-14 Thread justin.lebar+bug at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46031

   Summary: Atomic Builtins page should indicate that 16-byte
compare-and-swap is available with -mcex16
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: web
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: justin.lebar+...@gmail.com


The atomic builtins page doesn't indicate that gcc supports 16-byte
compare-and-swap instructions.  But gcc does in fact support this instruction,
so the page should indicate that the functionality can be enabled with -mcex16
(or, presumably, with the appropriate -march flag).

http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html


[Bug web/46031] Atomic Builtins page should indicate that 16-byte compare-and-swap is available with -mcex16

2010-10-14 Thread justin.lebar+bug at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46031

--- Comment #2 from Justin Lebar justin.lebar+bug at gmail dot com 2010-10-15 
05:30:20 UTC ---
Not all operations are supported by all target processors isn't the same as
not all operations supported by target processors are listed here.

If you want to be vague and say some target processors support other
operations or other operand sizes, not listed here, I guess that would be an
improvement.  But from a user's perspective, I'd like the manual to tell me
which builtins GCC supports, and under which circumstances those operations are
available.  In this case, the manual's omission of 16-bit cex suggested to me
that gcc didn't support it at all.


[Bug web/46031] Atomic Builtins page should indicate that 16-byte compare-and-swap is available with -mcex16

2010-10-14 Thread justin.lebar+bug at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46031

--- Comment #4 from Justin Lebar justin.lebar+bug at gmail dot com 2010-10-15 
05:54:28 UTC ---
 Actually it says the target processors might not include all of the builtins.

Maybe I'm not making sense.  My point is that there's a builtin (*) that is
supported by some target processors but is not listed in the manual.  So the
statement that some target processors might not include all the builtins isn't
helpful.

* The builtin is compare_and_swap with 16-byte operands.  The manual explicitly
says GCC will allow any integral scalar or pointer type that is 1, 2, 4 or 8
bytes in length but should indicate that 16-byte operands are supported for
compare_and_swap, at least under some circumstances.