Re: [RFC] load/store widening question

2015-02-19 Thread Ramana Radhakrishnan
On Thu, Feb 19, 2015 at 9:17 AM, Marat Zakirov m.zaki...@samsung.com wrote:
 Hi all!

 During my investigation I found that GCC does not performs load/store
 widening (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65088). Could you
 please answer is it so? And is there any plans to make it? I also would like
 to know is there any need to make load/store widening exclusively in ASan
 phase just for reducing number of ASAN_CHECKS?

 Example from the bug:

 $ cat t2.c

 int a[2];
 int b[2];

 int main ()
 {
   b[0] = a[0];
   b[1] = a[1];
   return 0;
 }


The answer is it depends. GCC can have SLP spot this in a generic form
across ports as in the example below.


AArch64 :

main:
adrpx0, a// 5*movdi_aarch64/11[length = 4]
addx0, x0, :lo12:a// 6add_losym_di[length = 4]
adrpx1, b// 8*movdi_aarch64/11[length = 4]
addx1, x1, :lo12:b// 9add_losym_di[length = 4]
ldrd0, [x0]// 7*aarch64_simd_movv2si/1[length = 4]
movw0, 0// 15*movsi_aarch64/4[length = 4]
strd0, [x1]// 10*aarch64_simd_movv2si/2[length = 4]
ret// 40simple_return[length = 4]


Or AArch32 without neon, the standard ldm peepholes / ldrd peepholes spot this.

main:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
movwr2, #:lower16:a
movwr3, #:lower16:b
movtr2, #:upper16:a
movtr3, #:upper16:b
ldmiar2, {r1, r2}
movr0, #0
stmiar3, {r1, r2}
bxlr


It will be interesting to see if the number of checks can be reduced
but I suspect you'll hit quite a few phase ordering issues and you'll
have quite a few variances between ports to make this work sensibly.



regards
Ramana


 $ gcc t2.c -O3 -S

 $ cat t2.s

 ...

 main:
 .LFB0:
 .cfi_startproc
 movla(%rip), %eax
 movl%eax, b(%rip)
 movla+4(%rip), %eax
 movl%eax, b+4(%rip)
 xorl%eax, %eax
 ret
 .cfi_endproc



 I will be very appreciate for your answers and thoughts.

 --Marat



[RFC] load/store widening question

2015-02-19 Thread Marat Zakirov

Hi all!

During my investigation I found that GCC does not performs load/store 
widening (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65088). Could you 
please answer is it so? And is there any plans to make it? I also would 
like to know is there any need to make load/store widening exclusively 
in ASan phase just for reducing number of ASAN_CHECKS?


Example from the bug:

$ cat t2.c

int a[2];
int b[2];

int main ()
{
  b[0] = a[0];
  b[1] = a[1];
  return 0;
}

$ gcc t2.c -O3 -S

$ cat t2.s

...

main:
.LFB0:
.cfi_startproc
movla(%rip), %eax
movl%eax, b(%rip)
movla+4(%rip), %eax
movl%eax, b+4(%rip)
xorl%eax, %eax
ret
.cfi_endproc



I will be very appreciate for your answers and thoughts.

--Marat



Re: [RFC] load/store widening question

2015-02-19 Thread Marat Zakirov


On 02/19/2015 12:25 PM, Ramana Radhakrishnan wrote:

On Thu, Feb 19, 2015 at 9:17 AM, Marat Zakirov m.zaki...@samsung.com wrote:

Hi all!

During my investigation I found that GCC does not performs load/store
widening (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65088). Could you
please answer is it so? And is there any plans to make it? I also would like
to know is there any need to make load/store widening exclusively in ASan
phase just for reducing number of ASAN_CHECKS?

Example from the bug:

$ cat t2.c

int a[2];
int b[2];

int main ()
{
   b[0] = a[0];
   b[1] = a[1];
   return 0;
}


The answer is it depends. GCC can have SLP spot this in a generic form
across ports as in the example below.


AArch64 :

main:
 adrpx0, a// 5*movdi_aarch64/11[length = 4]
 addx0, x0, :lo12:a// 6add_losym_di[length = 4]
 adrpx1, b// 8*movdi_aarch64/11[length = 4]
 addx1, x1, :lo12:b// 9add_losym_di[length = 4]
 ldrd0, [x0]// 7*aarch64_simd_movv2si/1[length = 4]
 movw0, 0// 15*movsi_aarch64/4[length = 4]
 strd0, [x1]// 10*aarch64_simd_movv2si/2[length = 4]
 ret// 40simple_return[length = 4]


Or AArch32 without neon, the standard ldm peepholes / ldrd peepholes spot this.

main:
 @ args = 0, pretend = 0, frame = 0
 @ frame_needed = 0, uses_anonymous_args = 0
 @ link register save eliminated.
 movwr2, #:lower16:a
 movwr3, #:lower16:b
 movtr2, #:upper16:a
 movtr3, #:upper16:b
 ldmiar2, {r1, r2}
 movr0, #0
 stmiar3, {r1, r2}
 bxlr


It will be interesting to see if the number of checks can be reduced
but I suspect you'll hit quite a few phase ordering issues and you'll
have quite a few variances between ports to make this work sensibly.



regards
Ramana



$ gcc t2.c -O3 -S

$ cat t2.s

...

main:
.LFB0:
 .cfi_startproc
 movla(%rip), %eax
 movl%eax, b(%rip)
 movla+4(%rip), %eax
 movl%eax, b+4(%rip)
 xorl%eax, %eax
 ret
 .cfi_endproc



I will be very appreciate for your answers and thoughts.

--Marat


Thank you very much Ramana.
I also would like x86 maintainers to explain why x86 GCC didn't handle 
given example?


--Marat