[Bug tree-optimization/60145] [AVR] Suboptimal code for byte order shuffling using shift and or

gjl at gcc dot gnu.org Mon, 28 Nov 2016 06:16:24 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60145


Georg-Johann Lay <gjl at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|other                       |tree-optimization

--- Comment #4 from Georg-Johann Lay <gjl at gcc dot gnu.org> ---
(In reply to Matthijs Kooijman from comment #3)
> I suppose you meant
> https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=242907 instead of the
> commit you linked (which is also nice btw, I noticed that extra sbiw in some
> places as well).

Ah yes, I meant r242907.

> Looking at the generated assembly, the optimizations look like fairly simple
> (composable) translations, but I assume that the optimization needs to
> happen before/while the assembly is generated, not afterwards. And then I
> can see that the patterns would indeed become complex.

Well, 4 extensions from u8 to u32, 3 shifts and 3 or make 10 operations which
have to be written as ONE single insn if the backend should catch this...  This
needs at least one intermediate pattern because otherwise the insn combiner
will not combine such complex expressions.

> My goal was indeed to compose values. Using a union is endian-dependent,
> which is a downside.

Maybe you can use GCC built-in macros like __BYTE_ORDER__ to factor out
endianess?

#define __ORDER_LITTLE_ENDIAN__ 1234
#define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__

(as displayed with -E -dM | grep END)

> If I understand your vector-example correctly, vectors are always stored as
> big endian, so using this approach would be portable? I couldn't find
> anything about this in the documentation, though.

Looking at the code the example

typedef uint8_t v4 __attribute__((vector_size (4)));

uint32_t join4 (uint8_t b0, uint8_t b1, uint8_t b2, uint8_t b3)
{
    v4 vec;
    vec[0] = b0;
    vec[1] = b1;
    vec[2] = b2;
    vec[3] = b3;
    return (uint32_t) vec;
}

generates, it's little endian, i.e. b0 is stored in the low byte of vec so that
vectors are similar to arrays:

join4:
        mov r23,r22 ; b1
        mov r25,r18 ; b3
        mov r22,r24 ; b0
        mov r24,r20 ; b2
        ret

On tree level, this is represented as BIT_INSERT_EXPR (from
-fdump-tree-optimized)

join4 (uint8_t b0, uint8_t b1, uint8_t b2, uint8_t b3)
{
  v4 vec;
  uint32_t _6;

  <bb 2>:
  vec_8  = BIT_INSERT_EXPR <vec_7(D), b0_2(D), 0 (8 bits)>;
  vec_9  = BIT_INSERT_EXPR <vec_8, b1_3(D), 8 (8 bits)>;
  vec_10 = BIT_INSERT_EXPR <vec_9, b2_4(D), 16 (8 bits)>;
  vec_11 = BIT_INSERT_EXPR <vec_10, b3_5(D), 24 (8 bits)>;
  _6 = VIEW_CONVERT_EXPR<uint32_t>(vec_11);
  return _6;
}

Hence an open-coded composition of byte-values into a 4-byte value can be
represented neatly on tree level, and this is quite an improvement over
ZERO_EXTEND + ASHIFT + IOR w.r.t. the resulting code.

Thus turned into a tree level optimization issue.

[Bug tree-optimization/60145] [AVR] Suboptimal code for byte order shuffling using shift and or

Reply via email to