[Bug c++/109963] ABI: unexpected layout ordering of `this` pointer in lambda capture
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109963 Justin Lebar changed: What|Removed |Added CC||justin.lebar+bug at gmail dot com --- Comment #2 from Justin Lebar --- I rediscovered this in https://github.com/openai/triton/issues/2398 (along with Mehdi, coincidentally). I'm really not sure how we fix this, though. If gcc changes their ABI, then new GCC binaries will not be compatible with old ones. OTOH if nobody changes their ABI, then we can't pass lambdas between binaries created by the different compilers.
[Bug target/43052] Inline memcmp is *much* slower than glibc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 --- Comment #13 from Justin Lebar justin.lebar+bug at gmail dot com 2011-07-04 14:40:40 UTC --- (In reply to comment #12) Created attachment 24670 [details] memcpy/memset testing script HJ, can you please run the attached script with new glibc as sh test_stringop 64 64000 gcc -march=native | tee out Do you think you could spin a script which also tests memcmp?
[Bug target/43052] Inline memcmp is *much* slower than glibc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 --- Comment #14 from Justin Lebar justin.lebar+bug at gmail dot com 2011-07-04 15:00:36 UTC --- Created attachment 24676 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=24676 Test results from my core i7
[Bug target/43052] Inline memcmp is *much* slower than glibc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 Justin Lebar justin.lebar+bug at gmail dot com changed: What|Removed |Added CC||justin.lebar+bug at gmail ||dot com --- Comment #8 from Justin Lebar justin.lebar+bug at gmail dot com 2011-06-13 18:09:07 UTC --- I just did some tests, and on my machine, glibc's memcmp is faster even when the size of the thing we're comparing is 4 bytes. I can't point to a case that this optimization speeds up on my machine.
[Bug target/43052] Inline memcmp is *much* slower than glibc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 --- Comment #10 from Justin Lebar justin.lebar+bug at gmail dot com 2011-06-13 18:18:13 UTC --- Can I force gcc not to use its inlined version?
[Bug target/46357] Unnecessary movzx instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46357 --- Comment #2 from Justin Lebar justin.lebar+bug at gmail dot com 2010-11-08 17:08:36 UTC --- (In reply to comment #1) We always use zero/sign-extending moves to avoid partial register stalls. Sure, but the whole instruction on line 10 is unnecessary.
[Bug c/46357] New: Unnecessary movzx instruction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46357 Summary: Unnecessary movzx instruction Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassig...@gcc.gnu.org ReportedBy: justin.lebar+...@gmail.com Originally reported to the gcc-help list. Tested with gcc Ubuntu/Linaro 4.5.1-7ubuntu2, but I get the same code with gcc 4.4. The following C code generates assembly code with what appears to be an unnecessary call to movzx: char skip[] = { /* ... */ }; int foo(const unsigned char *str, int len) { int result = 0; int i = 7; while (i len) { if (str[i] == '_' str[i-1] == 'D') { result |= 2; } i += skip[str[i]]; } return result; } foo: 0: 31 c0 xoreax,eax 2: 83 fe 07cmpesi,0x7 5: ba 07 00 00 00 movedx,0x7 a: 7f 14 jg 20 foo+0x20 c: eb 32 jmp40 foo+0x40 e: 66 90 xchg ax,ax // Beginning of loop 10: 0f b6 c9movzx ecx,cl 13: 0f be 89 00 00 00 00movsx ecx,BYTE PTR [rcx+0x0] 1a: 01 ca addedx,ecx 1c: 39 d6 cmpesi,edx 1e: 7e 20 jle40 foo+0x40 20: 4c 63 c2movsxd r8,edx 23: 42 0f b6 0c 07 movzx ecx,BYTE PTR [rdi+r8*1] 28: 80 f9 5fcmpcl,0x5f 2b: 75 e3 jne10 foo+0x10 // Likely end of loop (i.e. branch above is likely taken) 2d: 41 89 c1movr9d,eax 30: 41 83 c9 02 or r9d,0x2 34: 41 80 7c 38 ff 44 cmpBYTE PTR [r8+rdi*1-0x1],0x44 3a: 41 0f 44 c1 cmove eax,r9d 3e: eb d0 jmp10 foo+0x10 40: f3 c3 repz ret The movzx on line 10 sets everything except the least-significant bit of ecx to zero. This is unnecessary since line 23 dominates line 10, so we're guaranteed that ecx contains zeros everywhere except in its least-significant bit by the time we get to line 10. If I change |str| in the C code to a signed char, then line 10 becomes movsx (now a necessary instruction). Perhaps this gives a hint as to where the errant instruction is coming from.
[Bug web/46031] New: Atomic Builtins page should indicate that 16-byte compare-and-swap is available with -mcex16
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46031 Summary: Atomic Builtins page should indicate that 16-byte compare-and-swap is available with -mcex16 Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: web AssignedTo: unassig...@gcc.gnu.org ReportedBy: justin.lebar+...@gmail.com The atomic builtins page doesn't indicate that gcc supports 16-byte compare-and-swap instructions. But gcc does in fact support this instruction, so the page should indicate that the functionality can be enabled with -mcex16 (or, presumably, with the appropriate -march flag). http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html
[Bug web/46031] Atomic Builtins page should indicate that 16-byte compare-and-swap is available with -mcex16
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46031 --- Comment #2 from Justin Lebar justin.lebar+bug at gmail dot com 2010-10-15 05:30:20 UTC --- Not all operations are supported by all target processors isn't the same as not all operations supported by target processors are listed here. If you want to be vague and say some target processors support other operations or other operand sizes, not listed here, I guess that would be an improvement. But from a user's perspective, I'd like the manual to tell me which builtins GCC supports, and under which circumstances those operations are available. In this case, the manual's omission of 16-bit cex suggested to me that gcc didn't support it at all.
[Bug web/46031] Atomic Builtins page should indicate that 16-byte compare-and-swap is available with -mcex16
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46031 --- Comment #4 from Justin Lebar justin.lebar+bug at gmail dot com 2010-10-15 05:54:28 UTC --- Actually it says the target processors might not include all of the builtins. Maybe I'm not making sense. My point is that there's a builtin (*) that is supported by some target processors but is not listed in the manual. So the statement that some target processors might not include all the builtins isn't helpful. * The builtin is compare_and_swap with 16-byte operands. The manual explicitly says GCC will allow any integral scalar or pointer type that is 1, 2, 4 or 8 bytes in length but should indicate that 16-byte operands are supported for compare_and_swap, at least under some circumstances.