Re: [Patch]Bug 89057 - [8/9/10 Regression] AArch64 ld3 st4 less optimized

2019-04-29 Thread Jaydeep Chauhan
Hello Ramana,

Description for bug-89057:
--
This issue is occurs specifically with vcombine* method with optimization.
Assembly generated by vcombine* is less optimize. But individual
vcombine* method
can combining instruction properly and giving optimized output as per
below case.


Testcase 01:
---
int16x8_t
foo ()
{
  return vcombine_s16 (vdup_n_s16 (0), vdup_n_s16 (8));
}

Output with trunk ("-O2"):(Same for with patch)
-
foo:
.LFB3987:
.cfi_startproc
adrpx0, .LC0
ldr q0, [x0, #:lo12:.LC0]
ret
.cfi_endproc
.LFE3987:
.size   foo, .-foo
.section.rodata.cst16,"aM",@progbits,16
.align  4
.LC0:
.hword  0
.hword  0
.hword  0
.hword  0
.hword  8
.hword  8
.hword  8
.hword  8


But if vcombine call throught intrinsic function vst4_u8 combining is
not perform properly.
testcase is given below.

Testcase 02:
---

#include 
#include 

void qt_convert_rgb888_to_rgb32_neon(unsigned *dst, const unsigned
char *src, int len)
{
const unsigned *const end = dst + len;

const unsigned *const simdEnd = end - 7;
// non-inline asm version (uses more moves)
uint8x8x4_t dstVector;
dstVector.val[3] = vdup_n_u8(0xff);
do {
uint8x8x3_t srcVector = vld3_u8(src);
src += 3 * 8;
dstVector.val[0] = srcVector.val[2];
dstVector.val[1] = srcVector.val[1];
dstVector.val[2] = srcVector.val[0];
///Internally vst4_u8 is calling vcombine
vst4_u8((uint8_t*)dst, dstVector);
dst += 8;
} while (dst < simdEnd);

}

Output with trunk ("-O2"):
-
.L2:
ld3 {v4.8b - v6.8b}, [x1]
adrpx3, .LC1
add x1, x1, 24
ldr q3, [x3, #:lo12:.LC1]
mov v16.8b, v6.8b
mov v7.8b, v5.8b
mov v4.8b, v4.8b
ins v16.d[1], v17.d[0]
ins v7.d[1], v17.d[0]
ins v4.d[1], v17.d[0]
mov v0.16b, v16.16b
mov v1.16b, v7.16b
mov v2.16b, v4.16b
st4 {v0.8b - v3.8b}, [x0]
add x0, x0, 32
cmp x2, x0
bhi .L2
ret
.cfi_endproc

Output with trunk ("-O2") with patch:
-
.L2:
ld3 {v4.8b - v6.8b}, [x1]
adrpx3, .LC1
add x1, x1, 24
ldr q3, [x3, #:lo12:.LC1]
mov v0.8b, v6.8b
mov v1.8b, v5.8b
mov v2.8b, v4.8b
st4 {v0.8b - v3.8b}, [x0]
add x0, x0, 32
cmp x2, x0
bhi .L2
ret

For testcase 2 combine instruction is not done properly with
aarch64_split_simd_combine
method. Internally it is combine by complex way so that it is
generating more assembly
instruction in output. So to do combining instruction with optimal way
i have added
"aarch64_combine_internal"  created for this.

We can optimize output of both testcases with only
"aarch64_combine_internal" ,But when we
run without optimization it is create problem in split instruction. So
that i have
handle both cases with optimization ("aarch64_combine_internal")and
without optimizations (aarch64_split_simd_combine) separately.

Testing:
---
I have tested both testcase by executing below command in build directory
make -k check-gcc RUNTESTFLAGS="aarch64.exp=pr* -v" &>test.txt
and verified output of test.txt and also gcc.log.
In gcc.log status is for both case is PASS.

Thanks,
Jaydeep.

On Mon, Apr 29, 2019 at 4:19 PM Umesh Kalappa  wrote:
>
> >>Before getting started with reviewing the patch , the first question
> is whether you have a copyright assignment on file or does your
> employer have one on record with the FSF ?
>
> Ramana, We asked for copyright assignment form ,with details asked by
> copyright-cl...@fsf.org(craig)
> @ ass...@gnu.org.
>
> Waiting for the reply .
> ~Umesh
>
> On Mon, Apr 29, 2019 at 2:55 PM Ramana Radhakrishnan
>  wrote:
> >
> > On Mon, Apr 29, 2019 at 8:44 AM Jaydeep Chauhan
> >  wrote:
> > >
> > > Hi All,
> > >
> > > The attached patch (89057.patch) fixes the subjected issue.
> > > Please let me know your thoughts on the patch.
> >
> > Thanks for your patch.
> >
> > Before getting started with reviewing the patch , the first question
> > is whether you have a copyright assignment on file or does your
> > employer have one on record with the FSF ?
> >
> > Further could you elaborate more on what you have done to fix this by
> > 

[Patch]Bug 89057 - [8/9/10 Regression] AArch64 ld3 st4 less optimized

2019-04-29 Thread Jaydeep Chauhan
Hi All,

The attached patch (89057.patch) fixes the subjected issue.
Please let me know your thoughts on the patch.

Thanks,
Jaydeep.


89057.patch
Description: Binary data