arm: Implement SVE Permute - Predicates Group

Richard Henderson Fri, 23 Feb 2018 11:59:48 -0800

On 02/23/2018 07:15 AM, Peter Maydell wrote:
>> +static const uint64_t expand_bit_data[5][2] = {
>> +    { 0x1111111111111111ull, 0x2222222222222222ull },
>> +    { 0x0303030303030303ull, 0x0c0c0c0c0c0c0c0cull },
>> +    { 0x000f000f000f000full, 0x00f000f000f000f0ull },
>> +    { 0x000000ff000000ffull, 0x0000ff000000ff00ull },
>> +    { 0x000000000000ffffull, 0x00000000ffff0000ull }
>> +};
>> +
>> +/* Expand units of 2**N bits to units of 2**(N+1) bits,
>> +   with the higher bits zero.  */
> 
> In bitops.h we call this operation "half shuffle" (where
> it is specifically working on units of 1 bit size), and
> the inverse "half unshuffle". Worth mentioning that (or
> using similar terminology) ?


I hadn't noticed this helper.  I'll at least mention.

FWIW, the half_un/shuffle operation is what you get with N=0, which corresponds
to a byte predicate interleave.  We need the intermediate steps for half,
single, and double predicate interleaves.

>> +static uint64_t expand_bits(uint64_t x, int n)
>> +{
>> +    int i, sh;
> 
> Worth asserting that n is within the range we expect it to be ?
> (what range is that? 0 to 4?)

N goes from 0-3; I goes from 0-4.  N will have been controlled by decode, so
I'm not sure it's worth an assert.  Even if I did add one, I wouldn't want it
here, at the center of a loop kernel.

>> +        d[0] = nn + (mm << (1 << esz));
> 
> Is this actually doing an addition, or is it just an odd
> way of writing a bitwise OR when neither of the two
> inputs have 1 in the same bit position?

It could be an OR.  Here I'm hoping that the compiler will use a shift-add
instruction.  Which it wouldn't necessarily be able to prove by itself if I did
write it with an OR.

>> +        d[0] = extract64(l + (h << (4 * oprsz)), 0, 8 * oprsz);
> 
> This looks like it's using addition for logical OR again ?

Yes.  Although this time I admit it'll never produce an LEA.

>> +        /* For VL which is not a power of 2, the results from M do not
>> +           align nicely with the uint64_t for D.  Put the aligned results
>> +           from M into TMP_M and then copy it into place afterward.  */
> 
> How much risu testing did you do of funny vector lengths ?

As much as I can with the unlicensed Foundation Platform: all lengths from 1-4.

Which, unfortunately does leave a few multi-word predicate paths untested, but
many of the routines loop identically within this length and beyond.


>> +static const uint64_t even_bit_esz_masks[4] = {
>> +    0x5555555555555555ull,
>> +    0x3333333333333333ull,
>> +    0x0f0f0f0f0f0f0f0full,
>> +    0x00ff00ff00ff00ffull
>> +};
> 
> Comment describing the purpose of these numbers would be useful.

Ack.


r~

Re: [Qemu-devel] [PATCH v2 28/67] target/arm: Implement SVE Permute - Predicates Group

Reply via email to