[Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value

2020-08-25 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031

--- Comment #4 from zhongyunde at tom dot com  ---

> As for ivopt, I can see a minor improvement by replacing != exit condition
> with <=, thus saving add 2 instruction computing _22, which happens to
> "disable" the wrong PRE transformation.
> 
  I take a look at the function may_eliminate_iv, now iv_elimination_compare
will only return EQ_EXPR or NE_EXPR, so do you mean to do some extend for this
case?

5411   *bound = fold_convert (TREE_TYPE (cand->iv->base),
5412  aff_combination_to_tree (&bnd));
5413   *comp = iv_elimination_compare (data, use);
5414

[Bug c/96427] Missing align attribute for anchor section from local variables

2020-08-20 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96427

--- Comment #6 from zhongyunde at tom dot com  ---
Created attachment 49087
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49087&action=edit
adjust the alignment according the attibute

If user don't specify the alignment, so we can do some optimization.
otherwise, we can obey it firstly, similiar to the patch attached?

[Bug c/96586] New: suboptimal code generated for condition expression

2020-08-12 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96586

Bug ID: 96586
   Summary: suboptimal code generated for condition expression
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: zhongyunde at tom dot com
  Target Milestone: ---

For the following case, we can easy known the while loop will execute once, but
with newest gcc 10.2, it still generated suboptimal code with condition
expression.

void Proc_7 (int Int_Par_Ref);
void Proc_2 (int *Int_Par_Ref);

int main ()
{
int   Int_1_Loc;
int   Int_2_Loc;
int   Int_3_Loc;

  /* Initializations */
Int_1_Loc = 2;
Int_2_Loc = 3;

while (Int_1_Loc < Int_2_Loc)
{
  Proc_7 (0);

  Int_1_Loc += 1;
} /* while */

Int_1_Loc = 1;
Proc_2 (&Int_1_Loc);

  return 0;
}

== the key assemble of the while loop ===
.L2:
.loc 1 18 7 view .LVU10
.loc 1 20 7 view .LVU11
.loc 1 20 14 is_stmt 0 view .LVU12
mov edi, 5
callProc_7(int)
.LVL1:
.loc 1 22 7 is_stmt 1 view .LVU13
.loc 1 22 17 is_stmt 0 view .LVU14
mov eax, DWORD PTR [rsp+12]
add eax, 1
mov DWORD PTR [rsp+12], eax
.loc 1 16 5 is_stmt 1 view .LVU15
.loc 1 16 22 view .LVU16
cmp eax, 2
jle .L2

[Bug tree-optimization/93102] [optimization] is it legal to avoid accessing const local array from stack ?

2020-08-04 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93102

--- Comment #4 from zhongyunde at tom dot com  ---
case from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96427 generates *.LC0,
but don't emit an aggregate copy a_1 = *.LC0, i.e. it is legal even for
non-const local array.

typedef int v4si __attribute__((vector_size(64)));
int bar (v4si v);
int foo (int i)
{
  int a_1[131] = {38580, 691093, 378582, 691095, 938904, 251417, ... };
  v4si * ptr = (v4si *)a_1;
  v4si v = ptr[0];
  return bar (v);
}

[Bug c/96427] Missing align attribute for anchor section from local variables

2020-08-03 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96427

--- Comment #2 from zhongyunde at tom dot com  ---
should the data alignment honor the user specified ?

Now, it seems compiler _do_ align the initializer according align load. 
so even if the local array doesn't specify the __attribute__((aligned(64))), it
still align to 64 bytes.

[Bug rtl-optimization/95696] regrename creates overlapping register allocations for vliw

2020-08-03 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95696

--- Comment #6 from zhongyunde at tom dot com  ---
Thanks for you notes and I thinks this issue can be closed now.

It doesn't need to handle of non-SMS cases as they'll reschedule in general,
which is good for performance under my test.

[Bug c/96427] New: Missing align attribute for anchor section from local variables

2020-08-03 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96427

Bug ID: 96427
   Summary: Missing align attribute for anchor section from local
variables
   Product: gcc
   Version: 9.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: zhongyunde at tom dot com
  Target Milestone: ---

For the following code, we can known the local array a_1 is aligned 64 bytes,
but now gcc only aligned to default 32 bytes for related anchor data.

== test case 
int bar (long long v);
int foo (int i)
{
  long long v;
  int a_1[131] __attribute__((aligned(64))) = {38580, 691093, 378582, 691095,
938904, 251417, 38906, 251419, 2938908, 251421, 938910, 4863, 92352, 104865,
792354, 4867, 2792356,251429, 938918,251431, 938920,251433, 938922, 104875,
22792364, 104877, 2792366, 104879, 2792368, 104881, 6180210,8492723,
6180212,8492725,
6180214,8492727,33656,346169,33658,346171,33660,346173,33662,8492735,
6180224,8492737, 6180226,8492739,
6180228,346181,33670,346183,33672,346185,33674,7906507, 593996,7906509,
593998,7906511, 594000,7906513,447442,7759955,447444,7759957,447446,7759959,
594008,7906521, 594010,7906523, 594012,7906525,
594014,7759967,447456,7759969,447458,7759971,447460,8492773, 6180262,8492775,
6180264,8492777, 6180266,346219,33708,346221,33710,346223,33712,346225,
6180274,8492787, 6180276,8492789,
6180278,8492791,33720,346233,33722,346235,33724,346237,33726,7906559,
594048,7906561, 594050,7906563,
594052,7760005,447494,7760007,447496,7760009,447498,7906571, 594060,7906573,
594062,7906575, 94064, 7906577, 447506, 760019, 447508, 760021, 447510};
  const long long * ptr = (const long long *)a_1;
  v = ptr[0];

  return bar (v);
} 

= test base on the X86 gcc 9.3 on https://gcc.godbolt.org  = 
.text
.Ltext0:
.section   .rodata
.align 32  # here, use the default alignment 32 byte of section .rodata
.LC0:
.long   38580
.long   691093
.long   378582
...

foo(int):
mov rdi, QWORD PTR .LC0[rip]
jmp bar(long long)

[Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value

2020-07-20 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031

--- Comment #3 from zhongyunde at tom dot com  ---
I find there is some different between the two cases during in ivopts.

For the 2nd case, a UINT32 type iv sum is choosed
  [local count: 955630224]:
  # sum_15 = PHI <0(5), sum_9(6)>
  # ivtmp.10_17 = PHI 
  _2 = (short unsigned int) sum_15;
  _1 = _2;
  _11 = (void *) ivtmp.10_17;
  MEM[base: _11, offset: 0B] = _1;
  sum_9 = step_8(D) + sum_15;
  ivtmp.10_4 = ivtmp.10_17 + 2;
  if (ivtmp.10_4 != _22)
goto ; [89.00%]

For the 1st case, a 'short unsigned int type' ivtmp.8 is choosed as your dump
showed, and there is no UINT32 type candidate with Step step.

typedef unsigned int UINT32;
typedef unsigned short UINT16;

UINT16 array[12];

void foo (UINT32 len, UINT32 step)  
{
UINT32 index = 0;
UINT32 sum = 0;
for (index = 0; index < len; index++ )
{  
sum = index * step;
array[index] = sum;
}
}

I tried to add a UINT32 type temporary sum as above case (the 3rd case), then
modify the gcc to add an UINT32 type candidate variable and adjust the cost to
choose the Candidate variable (do the similar things as the 2nd case in ivopt),
then we can also optimize the 'and w2, w2, 65535' insn.
But above method is not conformed to the implementation method of ivopt, may be
we need extend an UINT32 candidate variable base 'on short unsigned int' IV
struct ?

= the change of gcc to add UINT32 type candidate variable
==
@@ -3389,7 +3389,7 @@ add_iv_candidate_for_bivs (struct ivopts_data *data)
   EXECUTE_IF_SET_IN_BITMAP (data->relevant, 0, i, bi)
 {
   iv = ver_info (data, i)->iv;
-  if (iv && iv->biv_p && !integer_zerop (iv->step))
+  if (iv && !integer_zerop (iv->step))
add_iv_candidate_for_biv (data, iv);
 }
 }

[Bug rtl-optimization/95696] regrename creates overlapping register allocations for vliw

2020-07-19 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95696

--- Comment #3 from zhongyunde at tom dot com  ---
(In reply to Richard Biener from comment #2)
> Please send patches to gcc-patc...@gcc.gnu.org

I have send this patch by email according your suggestion, please give me some
advice, thanks!

[Bug rtl-optimization/96031] suboptimal codegen for store low 16-bits value

2020-07-06 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031

--- Comment #1 from zhongyunde at tom dot com  ---
this may can be enhance by ivopts.
If the case adjusted as following, then the 'and w2, w2, 65535 ' will
disappear.


typedef unsigned int UINT32;
typedef unsigned short UINT16;


UINT16 array[12];

void foo (UINT32 len, UINT32 step)  
{
UINT32 index = 0;
UINT32 sum = 0;
for (index = 0; index < len; index++ )
{  
array[index] = sum;
sum += step;
}
}

// the assemble of kernel loop body --
.L9:
add x2, x2, 2 // ivtmp.6, ivtmp.6,
.L3:
strhw3, [x4]// sum, MEM[base: _12, offset: 0B]
cmp x2, x0// ivtmp.6, _22
add w3, w3, w1// sum, sum, step
mov x4, x2// ivtmp.6, ivtmp.6
bne .L9 //,

[Bug rtl-optimization/96031] New: suboptimal codegen for store low 16-bits value

2020-07-02 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96031

Bug ID: 96031
   Summary: suboptimal codegen for store low 16-bits value
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: zhongyunde at tom dot com
  Target Milestone: ---

For the following code, as instruction strh only store the low 16-bits value,
so the 'and w2, w2, 65535 ' is redundant.
test base on the ARM64 gcc 8.2 on https://gcc.godbolt.org/, so get complicated
assemble.

typedef unsigned int UINT32;
typedef unsigned short UINT16;


UINT16 array[12];

void foo (UINT32 len, UINT32 step)  
{
UINT32 index = 1;

for (index = 1 ; index < len; index++ )
{
array[index] = index * step;
}
}

// the assemble of kernel loop body --
b   .L4 //
.L6:
add x3, x3, 2 // ivtmp.6, ivtmp.6,
.L4:
strhw2, [x4, 2] // ivtmp.4, MEM[base: _2, offset: 2B]
add w2, w1, w2// tmp105, _12, ivtmp.4
and w2, w2, 65535 // ivtmp.4, tmp105 
cmp x3, x0// ivtmp.6, _23
mov x4, x3// ivtmp.6, ivtmp.6
bne .L6 //,

[Bug rtl-optimization/95696] regrename creates overlapping register allocations for vliw

2020-06-16 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95696

zhongyunde at tom dot com  changed:

   What|Removed |Added

 CC||zhongyunde at tom dot com

--- Comment #1 from zhongyunde at tom dot com  ---
Created attachment 48739
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48739&action=edit
Step 7: Close chains for registers that were never really used delayed at the
end of vliw

I make a patch, please help to review, tks.

[Bug rtl-optimization/95696] New: regrename creates overlapping register allocations for vliw

2020-06-16 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95696

Bug ID: 95696
   Summary: regrename creates overlapping register allocations for
vliw
   Product: gcc
   Version: 7.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: zhongyunde at tom dot com
  Target Milestone: ---

In some target, it is limited to issue two insns with change the same
register.(The insn 73 start with insn:TI, so it will be issued together with
others insns  until a new insn start with insn:TI, such as insn 71)
The regrename can known the mode V2VF in insn 73 need two successive registers,
i.e. v2 and v3, here is dump snippet before the regrename.

(insn:TI 73 76 71 4 (set (reg/v:V2VF 37 v2 [orig:180 _62 ] [180])
(unspec:V2VF [
(reg/v:VHF 43 v8 [orig:210 Dest_value ] [210])
(reg/v:VHF 43 v8 [orig:210 Dest_value ] [210])
] UNSPEC_HFSQMAG_32X32)) "../test_modify.c":57 710 {hfsqmag_v2vf}
 (expr_list:REG_DEAD (reg/v:VHF 43 v8 [orig:210 Dest_value ] [210])
(expr_list:REG_UNUSED (reg:VHF 38 v3)
(expr_list:REG_STAGE (const_int 2 [0x2])
(expr_list:REG_CYCLE (const_int 2 [0x2])
(expr_list:REG_UNITS (const_int 256 [0x100])
(nil)))

(insn 71 73 243 4 (set (reg:VHF 43 v8 [orig:265 MEM[(const vfloat32x16
*)Src_base_134] ] [265])
(mem:VHF (reg/v/f:DI 13 a13 [orig:207 Src_base ] [207]) [1 MEM[(const
vfloat32x16 *)Src_base_134]+0 S64 A512])) "../test_modify.c":56 450
{movvhf_internal}
 (expr_list:REG_STAGE (const_int 1 [0x1])
(expr_list:REG_CYCLE (const_int 2 [0x2])
(nil

Then, in the regrename, the insn 71 will be transformed into following code
with register v3, so there is an conflict between insn 73 and insn 71, as both
of them set the v3 register.

Register v2 (2): 73 [SVEC_REGS]
Register v8 (1): 71 [VEC_ALL_REGS]



(insn 71 73 243 4 (set (reg:VHF 38 v3 [orig:265 MEM[(const vfloat32x16
*)Src_base_134] ] [265])
(mem:VHF (reg/v/f:DI 13 a13 [orig:207 Src_base ] [207]) [1 MEM[(const
vfloat32x16 *)Src_base_134]+0 S64 A512])) "../test_modify.c":56 450
{movvhf_internal}
 (expr_list:REG_STAGE (const_int 1 [0x1])
(expr_list:REG_CYCLE (const_int 2 [0x2])

[Bug rtl-optimization/95267] [ICE][gcse]: in process_insert_insn at gcse.c

2020-05-21 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95267

zhongyunde at tom dot com  changed:

   What|Removed |Added

 CC||zhongyunde at tom dot com

--- Comment #6 from zhongyunde at tom dot com  ---
*** Bug 95210 has been marked as a duplicate of this bug. ***

[Bug rtl-optimization/95210] internal compiler error: in prepare_copy_insn, at gcse.c:1988

2020-05-21 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95210

zhongyunde at tom dot com  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #3 from zhongyunde at tom dot com  ---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95267

*** This bug has been marked as a duplicate of bug 95267 ***

[Bug c/95210] internal compiler error: in prepare_copy_insn, at gcse.c:1988

2020-05-19 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95210

--- Comment #1 from zhongyunde at tom dot com  ---
patch for this issue.

@ linux-9z2e in ~/software/gcc/gcc on git:master o [23:02:26] 
$ git diff
diff --git a/gcc/gcse.c b/gcc/gcse.c
index 8b9518e..65982ec 100644
--- a/gcc/gcse.c
+++ b/gcc/gcse.c
@@ -853,7 +853,7 @@ can_assign_to_reg_without_clobbers_p (rtx x, machine_mode
mode)
 {
   test_insn
= make_insn_raw (gen_rtx_SET (gen_rtx_REG (word_mode,
-  FIRST_PSEUDO_REGISTER * 2),
+  max_regno + 1),
  const0_rtx));

[Bug c/95210] New: internal compiler error: in prepare_copy_insn, at gcse.c:1988

2020-05-19 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95210

Bug ID: 95210
   Summary: internal compiler error: in prepare_copy_insn, at
gcse.c:1988
   Product: gcc
   Version: 9.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: zhongyunde at tom dot com
  Target Milestone: ---

rtx_insn *
prepare_copy_insn (rtx reg, rtx exp)
{
  ... 
  else
{
  rtx_insn *insn = emit_insn (gen_rtx_SET (reg, exp));

  if (insn_invalid_p (insn, false))
gcc_unreachable ();  // here is the ICE ...
}

  pat = get_insns ();
  end_sequence ();  

  return pat;
}

As the function can_assign_to_reg_without_clobbers_p, we try to check an
temporary insn with regno 'FIRST_PSEUDO_REGISTER * 2'. So in some corner case,
such as a pattern with inout operand, the regno 'FIRST_PSEUDO_REGISTER * 2' is
just equal to the the regno in the REG_EQUAL (FIRST_PSEUDO_REGISTER = 117),
then the temporary insn is valid, but it come fail when alloc another regno for
it, here is this issue.

(set (reg/v:V8HF16 236 )
  (unspec: V8HF18 [ (reg: V8HF18 150)
(reg: V8HF18 236)] UNSPEC_MOVTVFM)) 
   (expr_list:REG_EQUAL (unspec: V8HF18 [ (reg: V8HF18 150)
  (reg: V8HF18 234)] UNSPEC_MOVTVFM ))

bool
can_assign_to_reg_without_clobbers_p (rtx x, machine_mode mode)
{
   

  /* Otherwise, check if we can make a valid insn from it.  First initialize
 our test insn if we haven't already.  */
  if (test_insn == 0)
{
  test_insn
= make_insn_raw (gen_rtx_SET (gen_rtx_REG (word_mode,
   FIRST_PSEUDO_REGISTER * 2),
  const0_rtx));
  SET_NEXT_INSN (test_insn) = SET_PREV_INSN (test_insn) = 0;
  INSN_LOCATION (test_insn) = UNKNOWN_LOCATION;
}

  /* Now make an insn like the one we would make when GCSE'ing and see if
 valid.  */
  PUT_MODE (SET_DEST (PATTERN (test_insn)), mode);
  SET_SRC (PATTERN (test_insn)) = x;

  icode = recog (PATTERN (test_insn), test_insn, &num_clobbers);

[Bug tree-optimization/95019] Optimizer produces suboptimal code related to -ftree-ivopts

2020-05-12 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95019

--- Comment #2 from zhongyunde at tom dot com  ---
It is a generic issue for all targets, such as x86, it also don't enpand IVOPTs
as index is not used for DEST and Src directly. we may need expand IVOPTs, then
different targets can select different one according their Cost model.
Now, it seems ok for x86 as it have load/store insns folded the lshift operand,
so it doesn't need separate lshift operand in loop body .

== base on the ARM gcc 9.2.1 on https://gcc.godbolt.org, You'll get
separate lshift operand lsl in loop kernel, and ARM64 gcc 8.2 will use ldr
x3, [x1, x4, lsl 3] to avoid the separate lshift operand. so we can see all
target dont select an IV with Step 8. 
C0ADA(unsigned long long, long long*, long long*):
push{r4, r5, r6, r7, lr}@
mov r4, r0@ len, tmp135
mov r5, r1@ len, tmp136
orrsr1, r4, r5  @ tmp137, len
beq .L1 @,
mov r1, #0@ C05A1,
.L3:
lsl r0, r1, #3@ _2, C05A1,
add ip, r2, r1, lsl #3@ tmp120, Src, C05A1,
ldr lr, [r2, r0]  @ _4, *_3
ldr ip, [ip, #4]  @ _4, *_3
umull   r6, r7, lr, lr@ tmp125, _4, _4
mul ip, lr, ip@ tmp122, _4, tmp122
addsr1, r1, r4  @ C05A1, C05A1, len
subsr4, r4, #1  @ len, len,
sbc r5, r5, #0@ len, len,
add r0, r3, r0@ tmp121, Dest, _2
add r7, r7, ip, lsl #1@,, tmp122,
orrslr, r4, r5  @ tmp138, len
stm r0, {r6-r7}   @ *_5, tmp125
bne .L3 @,
.L1:
pop {r4, r5, r6, r7, lr}  @
bx  lr  @

Thanks for your notice.

[Bug tree-optimization/95019] New: Optimizer produces suboptimal code related to -ftree-ivopts

2020-05-09 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95019

Bug ID: 95019
   Summary: Optimizer produces suboptimal code related to
-ftree-ivopts
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: zhongyunde at tom dot com
  Target Milestone: ---

For the following code, we can known the variable C05A1 is only used for
the offset of array Dest and Src, and the unit size of the array is 8 bytes, so
an iv variable with step 8 will be good for targets, whose load/store insns
don't folded the lshift operand.

typedef unsigned long long UINT64;

void C0ADA(UINT64 len, long long *__restrict Src, long long *__restrict
Dest)
{
UINT64 C0ADD, index, C0068, offset, C0ADF;
UINT64 C05A1 = 0;

for (index = 0; index < len; index++) {

Dest[C05A1] =  Src[C05A1] * Src[C05A1];
C05A1 += len - index;
}
}

test base on the MIPS64 gcc 5.4 on https://gcc.godbolt.org, as the MIPS64
target doesn't have load/store folded the lshift operand such as 'ldr x3,
[x1, x4, lsl 3]' in ARM64 targets , so use ivtmp with step 8 can eliminate the
dsll insn, which is in the kernel loop.

@@ -2,16 +2,17 @@ C0ADA(unsigned long long, long long*, long long*):
 beq $4,$0,.L10 #, len,,
 move$7,$0# C05A1,

+dsll$8,$4,3  # tmp, len << 3  
+
 .L4:
-dsll$2,$7,3  # D.2019, C05A1,
-daddu   $3,$5,$2   # tmp204, Src, D.2019
+daddu   $3,$5,$7   # tmp204, Src, D.2019
 ld  $3,0($3) # D.2021, *_10
-daddu   $2,$6,$2   # tmp205, Dest, D.2019
+daddu   $2,$6,$7   # tmp205, Dest, D.2019
 dmult   $3,$3  # D.2021, D.2021
 daddu   $7,$7,$4   # C05A1, C05A1, ivtmp.6
-daddiu  $4,$4,-1 # ivtmp.6, ivtmp.6,
+daddiu  $4,$4,-8 # ivtmp.6, ivtmp.6,
 mflo$3   # D.2021
-bne $4,$0,.L4  #, ivtmp.6,,
+bne $8,$0,.L4  #, ivtmp.6,,
 sd  $3,0($2) # D.2021, *_8

 .L10:

[Bug c/94573] New: Optimizer produces suboptimal code related to -fstore-merging

2020-04-12 Thread zhongyunde at tom dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94573

Bug ID: 94573
   Summary: Optimizer produces suboptimal code related to
-fstore-merging
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: zhongyunde at tom dot com
  Target Milestone: ---

For the following code, we can known init the array C16DD is always 
consecutive, so we can use the more bigger mode size.
test base on the x86-64 gcc 9.2 on https://gcc.godbolt.org/, now it is still
handled DWORD by DWORD, and we except optimize it with QWORD or more bigger
size.

extern signed int C16DD[43][12]; 

void C1F93(int index)
{
C16DD[index][0] = 0;
C16DD[index][1] = 0;
C16DD[index][2] = 0;
C16DD[index][3] = 0;
C16DD[index][4] = 0;
C16DD[index][5] = 0;
C16DD[index][6] = 0;
C16DD[index][7] = 0;

return;
}

= related assemble =
C1F93(int):
movsx   rdi, edi
lea rax, [rdi+rdi*2]
sal rax, 4
mov DWORD PTR C16DD[rax], 0
mov DWORD PTR C16DD[rax+4], 0
mov DWORD PTR C16DD[rax+8], 0
mov DWORD PTR C16DD[rax+12], 0
mov DWORD PTR C16DD[rax+16], 0
mov DWORD PTR C16DD[rax+20], 0
mov DWORD PTR C16DD[rax+24], 0
mov DWORD PTR C16DD[rax+28], 0
ret