[Bug target/110104] gcc produces sub-optimal code for _addcarry_u64 chain

2023-07-07 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110104 --- Comment #5 from Mason --- FWIW, trunk (gcc14) translates testcase3 to the same code as the other testcases, while remaining portable across all architectures: $ gcc-trunk -O3 -march=bdver3 testcase3.c typedef unsigned long long u64;

[Bug target/110104] gcc produces sub-optimal code for _addcarry_u64 chain

2023-06-16 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110104 --- Comment #4 from Mason --- I confirm that trunk now emits the same code for testcase1 and testcase2. Thanks Jakub and Roger, great work!

[Bug target/110104] gcc produces sub-optimal code for _addcarry_u64 chain

2023-06-14 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110104 --- Comment #2 from Mason --- You meant PR79173 ;) Latest update: https://gcc.gnu.org/pipermail/gcc-patches/2023-June/621554.html I didn't see my testcase specifically in Jakub's patch, but I'll test trunk on godbolt when/if the patch lands.

[Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases

2023-06-13 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #20 from Mason --- Doh! You're right. I come from a background where overlapping/aliasing inputs are heresy, thus got blindsided :( This would be the optimal code, right? add4i: # rdi = dst, rsi = a, rdx = b movq

[Bug target/102974] GCC optimization is very poor for add carry and multiplication combos

2023-06-06 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102974 --- Comment #16 from Mason --- For the record, the example I provided was intended to show that, with some help, GCC can generate good code for bigint multiplication. In this situation, "help" means a short asm template.

[Bug target/102974] GCC optimization is very poor for add carry and multiplication combos

2023-06-06 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102974 --- Comment #12 from Mason --- Actually, in this case, we don't need to propagate the carry over 3 limbs. typedef unsigned int u32; typedef unsigned long long u64; /* u32 acc[2], a[1], b[1] */ static void mul_add_32x32(u32 *acc, const u32 *a,

[Bug target/102974] GCC optimization is very poor for add carry and multiplication combos

2023-06-03 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102974 --- Comment #11 from Mason --- Here's umul_least_64() rewritten as mul_64x64_128() in C typedef unsigned int u32; typedef unsigned long long u64; /* u32 acc[3], a[1], b[1] */ static void mul_add_32x32(u32 *acc, const u32 *a, const u32 *b) {

[Bug target/110104] New: gcc produces sub-optimal code for _addcarry_u64 chain

2023-06-03 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110104 Bug ID: 110104 Summary: gcc produces sub-optimal code for _addcarry_u64 chain Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3

[Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases

2023-06-01 Thread slash.tmp at free dot fr via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #18 from Mason --- Hello Michael_S, As far as I can see, massaging the source helps GCC generate optimal code (in terms of instruction count, not convinced about scheduling). #include typedef unsigned long long u64; void