On 12 June 2012 11:46, Richard Guenther <[email protected]> wrote:
> On Tue, Jun 12, 2012 at 11:22 AM, Julian Brown <[email protected]>
> wrote:
>> On Mon, 11 Jun 2012 16:46:27 +0100
>> Ramana Radhakrishnan <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I don't like the ML bits of the patch as it stands today and before
>>> committing I would like to clean up the ML bits quite a bit further
>>> especially in areas where I've put FIXMEs [...]
>>
>> I had a go at this, see attached. Untested. Note there are some
>> semantic differences in output:
>>
>> vzipq_p8 (poly8x16_t __a, poly8x16_t __b)
>> {
>> poly8x16x2_t __rv;
>> - uint8x16_t __mask1 = {0, 2};
>> - uint8x16_t __mask2 = {1, 3};
>> - __rv.val[0] = (poly8x16_t)__builtin_shuffle (__a, __b, __mask1);
>> - __rv.val[1] = (poly8x16_t)__builtin_shuffle (__a, __b, __mask2);
>> + uint8x16_t __mask1 = { 0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6,
>> 22, 7, 23 };
>> + uint8x16_t __mask2 = { 8, 24, 9, 25, 10, 26, 11, 27, 12, 28, 13, 29,
>> 14, 30, 15, 31 };
>> + __rv.val[0] = (poly8x16_t) __builtin_shuffle (__a, __b, __mask1);
>> + __rv.val[1] = (poly8x16_t) __builtin_shuffle (__a, __b, __mask2);
>
> You should get better code at -O0 when not using a temporary __mask1/__mask2
> but directly pasting the constant in the builtin call.
I tried that yesterday but it didn't seem to help - from a quick peek
at the dumps it looks like we could do with some limited const prop
just for the vec_perm expand cases.
D.14032 = { 0, 8, 1, 9, 2, 10, 3, 11 };
D.14044 = VEC_PERM_EXPR <__a, __b, D.14032>;
That's what I see from the dumps and from a quick skim of the sources
- my suspicion is that lower_vec_perm in tree-vect-generic.c is where
we could try doing a limited constant propagation in this case. ? Is
that where one should attempt to fix this ?
Consider the following testcase at O0 rewritten with just
__builtin_shuffle so that you can see it on other platforms as well
that have vec_perm_const defined for doing the interleave style
operations. and look at what you get for O1. On arm-linux-gnueabi with
-mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a9 at O0 you'd see it use
the generic permute operations and at O1 you'd see a vzip.32
instruction
typedef int v4si __attribute__ ((vector_size (16)));
v4si vs (v4si a, v4si b)
{
return __builtin_shuffle (a, b, (v4si) {0, 4, 1, 5});
}
regards
Ramana
>
>> return __rv;
>> }
>>
>> I wasn't quite sure which version was correct -- but your version
>> doesn't seem to have enough elements for these cases?
>>
>> HTH,
>>
>> Julian