Re: [RFC] load/store widening question
On Thu, Feb 19, 2015 at 9:17 AM, Marat Zakirov m.zaki...@samsung.com wrote: Hi all! During my investigation I found that GCC does not performs load/store widening (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65088). Could you please answer is it so? And is there any plans to make it? I also would like to know is there any need to make load/store widening exclusively in ASan phase just for reducing number of ASAN_CHECKS? Example from the bug: $ cat t2.c int a[2]; int b[2]; int main () { b[0] = a[0]; b[1] = a[1]; return 0; } The answer is it depends. GCC can have SLP spot this in a generic form across ports as in the example below. AArch64 : main: adrpx0, a// 5*movdi_aarch64/11[length = 4] addx0, x0, :lo12:a// 6add_losym_di[length = 4] adrpx1, b// 8*movdi_aarch64/11[length = 4] addx1, x1, :lo12:b// 9add_losym_di[length = 4] ldrd0, [x0]// 7*aarch64_simd_movv2si/1[length = 4] movw0, 0// 15*movsi_aarch64/4[length = 4] strd0, [x1]// 10*aarch64_simd_movv2si/2[length = 4] ret// 40simple_return[length = 4] Or AArch32 without neon, the standard ldm peepholes / ldrd peepholes spot this. main: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. movwr2, #:lower16:a movwr3, #:lower16:b movtr2, #:upper16:a movtr3, #:upper16:b ldmiar2, {r1, r2} movr0, #0 stmiar3, {r1, r2} bxlr It will be interesting to see if the number of checks can be reduced but I suspect you'll hit quite a few phase ordering issues and you'll have quite a few variances between ports to make this work sensibly. regards Ramana $ gcc t2.c -O3 -S $ cat t2.s ... main: .LFB0: .cfi_startproc movla(%rip), %eax movl%eax, b(%rip) movla+4(%rip), %eax movl%eax, b+4(%rip) xorl%eax, %eax ret .cfi_endproc I will be very appreciate for your answers and thoughts. --Marat
[RFC] load/store widening question
Hi all! During my investigation I found that GCC does not performs load/store widening (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65088). Could you please answer is it so? And is there any plans to make it? I also would like to know is there any need to make load/store widening exclusively in ASan phase just for reducing number of ASAN_CHECKS? Example from the bug: $ cat t2.c int a[2]; int b[2]; int main () { b[0] = a[0]; b[1] = a[1]; return 0; } $ gcc t2.c -O3 -S $ cat t2.s ... main: .LFB0: .cfi_startproc movla(%rip), %eax movl%eax, b(%rip) movla+4(%rip), %eax movl%eax, b+4(%rip) xorl%eax, %eax ret .cfi_endproc I will be very appreciate for your answers and thoughts. --Marat
Re: [RFC] load/store widening question
On 02/19/2015 12:25 PM, Ramana Radhakrishnan wrote: On Thu, Feb 19, 2015 at 9:17 AM, Marat Zakirov m.zaki...@samsung.com wrote: Hi all! During my investigation I found that GCC does not performs load/store widening (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65088). Could you please answer is it so? And is there any plans to make it? I also would like to know is there any need to make load/store widening exclusively in ASan phase just for reducing number of ASAN_CHECKS? Example from the bug: $ cat t2.c int a[2]; int b[2]; int main () { b[0] = a[0]; b[1] = a[1]; return 0; } The answer is it depends. GCC can have SLP spot this in a generic form across ports as in the example below. AArch64 : main: adrpx0, a// 5*movdi_aarch64/11[length = 4] addx0, x0, :lo12:a// 6add_losym_di[length = 4] adrpx1, b// 8*movdi_aarch64/11[length = 4] addx1, x1, :lo12:b// 9add_losym_di[length = 4] ldrd0, [x0]// 7*aarch64_simd_movv2si/1[length = 4] movw0, 0// 15*movsi_aarch64/4[length = 4] strd0, [x1]// 10*aarch64_simd_movv2si/2[length = 4] ret// 40simple_return[length = 4] Or AArch32 without neon, the standard ldm peepholes / ldrd peepholes spot this. main: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. movwr2, #:lower16:a movwr3, #:lower16:b movtr2, #:upper16:a movtr3, #:upper16:b ldmiar2, {r1, r2} movr0, #0 stmiar3, {r1, r2} bxlr It will be interesting to see if the number of checks can be reduced but I suspect you'll hit quite a few phase ordering issues and you'll have quite a few variances between ports to make this work sensibly. regards Ramana $ gcc t2.c -O3 -S $ cat t2.s ... main: .LFB0: .cfi_startproc movla(%rip), %eax movl%eax, b(%rip) movla+4(%rip), %eax movl%eax, b+4(%rip) xorl%eax, %eax ret .cfi_endproc I will be very appreciate for your answers and thoughts. --Marat Thank you very much Ramana. I also would like x86 maintainers to explain why x86 GCC didn't handle given example? --Marat