[Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031 --- Comment #4 from zhongyunde at tom dot com --- > As for ivopt, I can see a minor improvement by replacing != exit condition > with <=, thus saving add 2 instruction computing _22, which happens to > "disable" the wrong PRE transformation. > I take a look at the function may_eliminate_iv, now iv_elimination_compare will only return EQ_EXPR or NE_EXPR, so do you mean to do some extend for this case? 5411 *bound = fold_convert (TREE_TYPE (cand->iv->base), 5412 aff_combination_to_tree (&bnd)); 5413 *comp = iv_elimination_compare (data, use); 5414
[Bug c/96427] Missing align attribute for anchor section from local variables
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96427 --- Comment #6 from zhongyunde at tom dot com --- Created attachment 49087 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49087&action=edit adjust the alignment according the attibute If user don't specify the alignment, so we can do some optimization. otherwise, we can obey it firstly, similiar to the patch attached?
[Bug c/96586] New: suboptimal code generated for condition expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96586 Bug ID: 96586 Summary: suboptimal code generated for condition expression Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: zhongyunde at tom dot com Target Milestone: --- For the following case, we can easy known the while loop will execute once, but with newest gcc 10.2, it still generated suboptimal code with condition expression. void Proc_7 (int Int_Par_Ref); void Proc_2 (int *Int_Par_Ref); int main () { int Int_1_Loc; int Int_2_Loc; int Int_3_Loc; /* Initializations */ Int_1_Loc = 2; Int_2_Loc = 3; while (Int_1_Loc < Int_2_Loc) { Proc_7 (0); Int_1_Loc += 1; } /* while */ Int_1_Loc = 1; Proc_2 (&Int_1_Loc); return 0; } == the key assemble of the while loop === .L2: .loc 1 18 7 view .LVU10 .loc 1 20 7 view .LVU11 .loc 1 20 14 is_stmt 0 view .LVU12 mov edi, 5 callProc_7(int) .LVL1: .loc 1 22 7 is_stmt 1 view .LVU13 .loc 1 22 17 is_stmt 0 view .LVU14 mov eax, DWORD PTR [rsp+12] add eax, 1 mov DWORD PTR [rsp+12], eax .loc 1 16 5 is_stmt 1 view .LVU15 .loc 1 16 22 view .LVU16 cmp eax, 2 jle .L2
[Bug tree-optimization/93102] [optimization] is it legal to avoid accessing const local array from stack ?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93102 --- Comment #4 from zhongyunde at tom dot com --- case from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96427 generates *.LC0, but don't emit an aggregate copy a_1 = *.LC0, i.e. it is legal even for non-const local array. typedef int v4si __attribute__((vector_size(64))); int bar (v4si v); int foo (int i) { int a_1[131] = {38580, 691093, 378582, 691095, 938904, 251417, ... }; v4si * ptr = (v4si *)a_1; v4si v = ptr[0]; return bar (v); }
[Bug c/96427] Missing align attribute for anchor section from local variables
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96427 --- Comment #2 from zhongyunde at tom dot com --- should the data alignment honor the user specified ? Now, it seems compiler _do_ align the initializer according align load. so even if the local array doesn't specify the __attribute__((aligned(64))), it still align to 64 bytes.
[Bug rtl-optimization/95696] regrename creates overlapping register allocations for vliw
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95696 --- Comment #6 from zhongyunde at tom dot com --- Thanks for you notes and I thinks this issue can be closed now. It doesn't need to handle of non-SMS cases as they'll reschedule in general, which is good for performance under my test.
[Bug c/96427] New: Missing align attribute for anchor section from local variables
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96427 Bug ID: 96427 Summary: Missing align attribute for anchor section from local variables Product: gcc Version: 9.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: zhongyunde at tom dot com Target Milestone: --- For the following code, we can known the local array a_1 is aligned 64 bytes, but now gcc only aligned to default 32 bytes for related anchor data. == test case int bar (long long v); int foo (int i) { long long v; int a_1[131] __attribute__((aligned(64))) = {38580, 691093, 378582, 691095, 938904, 251417, 38906, 251419, 2938908, 251421, 938910, 4863, 92352, 104865, 792354, 4867, 2792356,251429, 938918,251431, 938920,251433, 938922, 104875, 22792364, 104877, 2792366, 104879, 2792368, 104881, 6180210,8492723, 6180212,8492725, 6180214,8492727,33656,346169,33658,346171,33660,346173,33662,8492735, 6180224,8492737, 6180226,8492739, 6180228,346181,33670,346183,33672,346185,33674,7906507, 593996,7906509, 593998,7906511, 594000,7906513,447442,7759955,447444,7759957,447446,7759959, 594008,7906521, 594010,7906523, 594012,7906525, 594014,7759967,447456,7759969,447458,7759971,447460,8492773, 6180262,8492775, 6180264,8492777, 6180266,346219,33708,346221,33710,346223,33712,346225, 6180274,8492787, 6180276,8492789, 6180278,8492791,33720,346233,33722,346235,33724,346237,33726,7906559, 594048,7906561, 594050,7906563, 594052,7760005,447494,7760007,447496,7760009,447498,7906571, 594060,7906573, 594062,7906575, 94064, 7906577, 447506, 760019, 447508, 760021, 447510}; const long long * ptr = (const long long *)a_1; v = ptr[0]; return bar (v); } = test base on the X86 gcc 9.3 on https://gcc.godbolt.org = .text .Ltext0: .section .rodata .align 32 # here, use the default alignment 32 byte of section .rodata .LC0: .long 38580 .long 691093 .long 378582 ... foo(int): mov rdi, QWORD PTR .LC0[rip] jmp bar(long long)
[Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031 --- Comment #3 from zhongyunde at tom dot com --- I find there is some different between the two cases during in ivopts. For the 2nd case, a UINT32 type iv sum is choosed [local count: 955630224]: # sum_15 = PHI <0(5), sum_9(6)> # ivtmp.10_17 = PHI _2 = (short unsigned int) sum_15; _1 = _2; _11 = (void *) ivtmp.10_17; MEM[base: _11, offset: 0B] = _1; sum_9 = step_8(D) + sum_15; ivtmp.10_4 = ivtmp.10_17 + 2; if (ivtmp.10_4 != _22) goto ; [89.00%] For the 1st case, a 'short unsigned int type' ivtmp.8 is choosed as your dump showed, and there is no UINT32 type candidate with Step step. typedef unsigned int UINT32; typedef unsigned short UINT16; UINT16 array[12]; void foo (UINT32 len, UINT32 step) { UINT32 index = 0; UINT32 sum = 0; for (index = 0; index < len; index++ ) { sum = index * step; array[index] = sum; } } I tried to add a UINT32 type temporary sum as above case (the 3rd case), then modify the gcc to add an UINT32 type candidate variable and adjust the cost to choose the Candidate variable (do the similar things as the 2nd case in ivopt), then we can also optimize the 'and w2, w2, 65535' insn. But above method is not conformed to the implementation method of ivopt, may be we need extend an UINT32 candidate variable base 'on short unsigned int' IV struct ? = the change of gcc to add UINT32 type candidate variable == @@ -3389,7 +3389,7 @@ add_iv_candidate_for_bivs (struct ivopts_data *data) EXECUTE_IF_SET_IN_BITMAP (data->relevant, 0, i, bi) { iv = ver_info (data, i)->iv; - if (iv && iv->biv_p && !integer_zerop (iv->step)) + if (iv && !integer_zerop (iv->step)) add_iv_candidate_for_biv (data, iv); } }
[Bug rtl-optimization/95696] regrename creates overlapping register allocations for vliw
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95696 --- Comment #3 from zhongyunde at tom dot com --- (In reply to Richard Biener from comment #2) > Please send patches to gcc-patc...@gcc.gnu.org I have send this patch by email according your suggestion, please give me some advice, thanks!
[Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031 --- Comment #1 from zhongyunde at tom dot com --- this may can be enhance by ivopts. If the case adjusted as following, then the 'and w2, w2, 65535 ' will disappear. typedef unsigned int UINT32; typedef unsigned short UINT16; UINT16 array[12]; void foo (UINT32 len, UINT32 step) { UINT32 index = 0; UINT32 sum = 0; for (index = 0; index < len; index++ ) { array[index] = sum; sum += step; } } // the assemble of kernel loop body -- .L9: add x2, x2, 2 // ivtmp.6, ivtmp.6, .L3: strhw3, [x4]// sum, MEM[base: _12, offset: 0B] cmp x2, x0// ivtmp.6, _22 add w3, w3, w1// sum, sum, step mov x4, x2// ivtmp.6, ivtmp.6 bne .L9 //,
[Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031 Bug ID: 96031 Summary: suboptimal codegen for store low 16-bits value Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: zhongyunde at tom dot com Target Milestone: --- For the following code, as instruction strh only store the low 16-bits value, so the 'and w2, w2, 65535 ' is redundant. test base on the ARM64 gcc 8.2 on https://gcc.godbolt.org/, so get complicated assemble. typedef unsigned int UINT32; typedef unsigned short UINT16; UINT16 array[12]; void foo (UINT32 len, UINT32 step) { UINT32 index = 1; for (index = 1 ; index < len; index++ ) { array[index] = index * step; } } // the assemble of kernel loop body -- b .L4 // .L6: add x3, x3, 2 // ivtmp.6, ivtmp.6, .L4: strhw2, [x4, 2] // ivtmp.4, MEM[base: _2, offset: 2B] add w2, w1, w2// tmp105, _12, ivtmp.4 and w2, w2, 65535 // ivtmp.4, tmp105 cmp x3, x0// ivtmp.6, _23 mov x4, x3// ivtmp.6, ivtmp.6 bne .L6 //,
[Bug rtl-optimization/95696] regrename creates overlapping register allocations for vliw
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95696 zhongyunde at tom dot com changed: What|Removed |Added CC||zhongyunde at tom dot com --- Comment #1 from zhongyunde at tom dot com --- Created attachment 48739 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48739&action=edit Step 7: Close chains for registers that were never really used delayed at the end of vliw I make a patch, please help to review, tks.
[Bug rtl-optimization/95696] New: regrename creates overlapping register allocations for vliw
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95696 Bug ID: 95696 Summary: regrename creates overlapping register allocations for vliw Product: gcc Version: 7.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: zhongyunde at tom dot com Target Milestone: --- In some target, it is limited to issue two insns with change the same register.(The insn 73 start with insn:TI, so it will be issued together with others insns until a new insn start with insn:TI, such as insn 71) The regrename can known the mode V2VF in insn 73 need two successive registers, i.e. v2 and v3, here is dump snippet before the regrename. (insn:TI 73 76 71 4 (set (reg/v:V2VF 37 v2 [orig:180 _62 ] [180]) (unspec:V2VF [ (reg/v:VHF 43 v8 [orig:210 Dest_value ] [210]) (reg/v:VHF 43 v8 [orig:210 Dest_value ] [210]) ] UNSPEC_HFSQMAG_32X32)) "../test_modify.c":57 710 {hfsqmag_v2vf} (expr_list:REG_DEAD (reg/v:VHF 43 v8 [orig:210 Dest_value ] [210]) (expr_list:REG_UNUSED (reg:VHF 38 v3) (expr_list:REG_STAGE (const_int 2 [0x2]) (expr_list:REG_CYCLE (const_int 2 [0x2]) (expr_list:REG_UNITS (const_int 256 [0x100]) (nil))) (insn 71 73 243 4 (set (reg:VHF 43 v8 [orig:265 MEM[(const vfloat32x16 *)Src_base_134] ] [265]) (mem:VHF (reg/v/f:DI 13 a13 [orig:207 Src_base ] [207]) [1 MEM[(const vfloat32x16 *)Src_base_134]+0 S64 A512])) "../test_modify.c":56 450 {movvhf_internal} (expr_list:REG_STAGE (const_int 1 [0x1]) (expr_list:REG_CYCLE (const_int 2 [0x2]) (nil Then, in the regrename, the insn 71 will be transformed into following code with register v3, so there is an conflict between insn 73 and insn 71, as both of them set the v3 register. Register v2 (2): 73 [SVEC_REGS] Register v8 (1): 71 [VEC_ALL_REGS] (insn 71 73 243 4 (set (reg:VHF 38 v3 [orig:265 MEM[(const vfloat32x16 *)Src_base_134] ] [265]) (mem:VHF (reg/v/f:DI 13 a13 [orig:207 Src_base ] [207]) [1 MEM[(const vfloat32x16 *)Src_base_134]+0 S64 A512])) "../test_modify.c":56 450 {movvhf_internal} (expr_list:REG_STAGE (const_int 1 [0x1]) (expr_list:REG_CYCLE (const_int 2 [0x2])
[Bug rtl-optimization/95267] [ICE][gcse]: in process_insert_insn at gcse.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95267 zhongyunde at tom dot com changed: What|Removed |Added CC||zhongyunde at tom dot com --- Comment #6 from zhongyunde at tom dot com --- *** Bug 95210 has been marked as a duplicate of this bug. ***
[Bug rtl-optimization/95210] internal compiler error: in prepare_copy_insn, at gcse.c:1988
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95210 zhongyunde at tom dot com changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #3 from zhongyunde at tom dot com --- https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95267 *** This bug has been marked as a duplicate of bug 95267 ***
[Bug c/95210] internal compiler error: in prepare_copy_insn, at gcse.c:1988
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95210 --- Comment #1 from zhongyunde at tom dot com --- patch for this issue. @ linux-9z2e in ~/software/gcc/gcc on git:master o [23:02:26] $ git diff diff --git a/gcc/gcse.c b/gcc/gcse.c index 8b9518e..65982ec 100644 --- a/gcc/gcse.c +++ b/gcc/gcse.c @@ -853,7 +853,7 @@ can_assign_to_reg_without_clobbers_p (rtx x, machine_mode mode) { test_insn = make_insn_raw (gen_rtx_SET (gen_rtx_REG (word_mode, - FIRST_PSEUDO_REGISTER * 2), + max_regno + 1), const0_rtx));
[Bug c/95210] New: internal compiler error: in prepare_copy_insn, at gcse.c:1988
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95210 Bug ID: 95210 Summary: internal compiler error: in prepare_copy_insn, at gcse.c:1988 Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: zhongyunde at tom dot com Target Milestone: --- rtx_insn * prepare_copy_insn (rtx reg, rtx exp) { ... else { rtx_insn *insn = emit_insn (gen_rtx_SET (reg, exp)); if (insn_invalid_p (insn, false)) gcc_unreachable (); // here is the ICE ... } pat = get_insns (); end_sequence (); return pat; } As the function can_assign_to_reg_without_clobbers_p, we try to check an temporary insn with regno 'FIRST_PSEUDO_REGISTER * 2'. So in some corner case, such as a pattern with inout operand, the regno 'FIRST_PSEUDO_REGISTER * 2' is just equal to the the regno in the REG_EQUAL (FIRST_PSEUDO_REGISTER = 117), then the temporary insn is valid, but it come fail when alloc another regno for it, here is this issue. (set (reg/v:V8HF16 236 ) (unspec: V8HF18 [ (reg: V8HF18 150) (reg: V8HF18 236)] UNSPEC_MOVTVFM)) (expr_list:REG_EQUAL (unspec: V8HF18 [ (reg: V8HF18 150) (reg: V8HF18 234)] UNSPEC_MOVTVFM )) bool can_assign_to_reg_without_clobbers_p (rtx x, machine_mode mode) { /* Otherwise, check if we can make a valid insn from it. First initialize our test insn if we haven't already. */ if (test_insn == 0) { test_insn = make_insn_raw (gen_rtx_SET (gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER * 2), const0_rtx)); SET_NEXT_INSN (test_insn) = SET_PREV_INSN (test_insn) = 0; INSN_LOCATION (test_insn) = UNKNOWN_LOCATION; } /* Now make an insn like the one we would make when GCSE'ing and see if valid. */ PUT_MODE (SET_DEST (PATTERN (test_insn)), mode); SET_SRC (PATTERN (test_insn)) = x; icode = recog (PATTERN (test_insn), test_insn, &num_clobbers);
[Bug tree-optimization/95019] Optimizer produces suboptimal code related to -ftree-ivopts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95019 --- Comment #2 from zhongyunde at tom dot com --- It is a generic issue for all targets, such as x86, it also don't enpand IVOPTs as index is not used for DEST and Src directly. we may need expand IVOPTs, then different targets can select different one according their Cost model. Now, it seems ok for x86 as it have load/store insns folded the lshift operand, so it doesn't need separate lshift operand in loop body . == base on the ARM gcc 9.2.1 on https://gcc.godbolt.org, You'll get separate lshift operand lsl in loop kernel, and ARM64 gcc 8.2 will use ldr x3, [x1, x4, lsl 3] to avoid the separate lshift operand. so we can see all target dont select an IV with Step 8. C0ADA(unsigned long long, long long*, long long*): push{r4, r5, r6, r7, lr}@ mov r4, r0@ len, tmp135 mov r5, r1@ len, tmp136 orrsr1, r4, r5 @ tmp137, len beq .L1 @, mov r1, #0@ C05A1, .L3: lsl r0, r1, #3@ _2, C05A1, add ip, r2, r1, lsl #3@ tmp120, Src, C05A1, ldr lr, [r2, r0] @ _4, *_3 ldr ip, [ip, #4] @ _4, *_3 umull r6, r7, lr, lr@ tmp125, _4, _4 mul ip, lr, ip@ tmp122, _4, tmp122 addsr1, r1, r4 @ C05A1, C05A1, len subsr4, r4, #1 @ len, len, sbc r5, r5, #0@ len, len, add r0, r3, r0@ tmp121, Dest, _2 add r7, r7, ip, lsl #1@,, tmp122, orrslr, r4, r5 @ tmp138, len stm r0, {r6-r7} @ *_5, tmp125 bne .L3 @, .L1: pop {r4, r5, r6, r7, lr} @ bx lr @ Thanks for your notice.
[Bug tree-optimization/95019] New: Optimizer produces suboptimal code related to -ftree-ivopts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95019 Bug ID: 95019 Summary: Optimizer produces suboptimal code related to -ftree-ivopts Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: zhongyunde at tom dot com Target Milestone: --- For the following code, we can known the variable C05A1 is only used for the offset of array Dest and Src, and the unit size of the array is 8 bytes, so an iv variable with step 8 will be good for targets, whose load/store insns don't folded the lshift operand. typedef unsigned long long UINT64; void C0ADA(UINT64 len, long long *__restrict Src, long long *__restrict Dest) { UINT64 C0ADD, index, C0068, offset, C0ADF; UINT64 C05A1 = 0; for (index = 0; index < len; index++) { Dest[C05A1] = Src[C05A1] * Src[C05A1]; C05A1 += len - index; } } test base on the MIPS64 gcc 5.4 on https://gcc.godbolt.org, as the MIPS64 target doesn't have load/store folded the lshift operand such as 'ldr x3, [x1, x4, lsl 3]' in ARM64 targets , so use ivtmp with step 8 can eliminate the dsll insn, which is in the kernel loop. @@ -2,16 +2,17 @@ C0ADA(unsigned long long, long long*, long long*): beq $4,$0,.L10 #, len,, move$7,$0# C05A1, +dsll$8,$4,3 # tmp, len << 3 + .L4: -dsll$2,$7,3 # D.2019, C05A1, -daddu $3,$5,$2 # tmp204, Src, D.2019 +daddu $3,$5,$7 # tmp204, Src, D.2019 ld $3,0($3) # D.2021, *_10 -daddu $2,$6,$2 # tmp205, Dest, D.2019 +daddu $2,$6,$7 # tmp205, Dest, D.2019 dmult $3,$3 # D.2021, D.2021 daddu $7,$7,$4 # C05A1, C05A1, ivtmp.6 -daddiu $4,$4,-1 # ivtmp.6, ivtmp.6, +daddiu $4,$4,-8 # ivtmp.6, ivtmp.6, mflo$3 # D.2021 -bne $4,$0,.L4 #, ivtmp.6,, +bne $8,$0,.L4 #, ivtmp.6,, sd $3,0($2) # D.2021, *_8 .L10:
[Bug c/94573] New: Optimizer produces suboptimal code related to -fstore-merging
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94573 Bug ID: 94573 Summary: Optimizer produces suboptimal code related to -fstore-merging Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: zhongyunde at tom dot com Target Milestone: --- For the following code, we can known init the array C16DD is always consecutive, so we can use the more bigger mode size. test base on the x86-64 gcc 9.2 on https://gcc.godbolt.org/, now it is still handled DWORD by DWORD, and we except optimize it with QWORD or more bigger size. extern signed int C16DD[43][12]; void C1F93(int index) { C16DD[index][0] = 0; C16DD[index][1] = 0; C16DD[index][2] = 0; C16DD[index][3] = 0; C16DD[index][4] = 0; C16DD[index][5] = 0; C16DD[index][6] = 0; C16DD[index][7] = 0; return; } = related assemble = C1F93(int): movsx rdi, edi lea rax, [rdi+rdi*2] sal rax, 4 mov DWORD PTR C16DD[rax], 0 mov DWORD PTR C16DD[rax+4], 0 mov DWORD PTR C16DD[rax+8], 0 mov DWORD PTR C16DD[rax+12], 0 mov DWORD PTR C16DD[rax+16], 0 mov DWORD PTR C16DD[rax+20], 0 mov DWORD PTR C16DD[rax+24], 0 mov DWORD PTR C16DD[rax+28], 0 ret