Re: [i386] Scalar DImode instructions on XMM registers

Ilya Enkovich Wed, 06 May 2015 07:18:59 -0700

2015-04-25 4:32 GMT+03:00 Jan Hubicka <hubi...@ucw.cz>:
> Hi,
> I am adding Vladimir and Richard into CC. I tried to solve similar problem
> with FP math years ago by having -mfpmath=sse,i387. The idea was to allow
> use of i387 registers when SSE ones run out and possibly also model the fact
> that Pentium4 had faster i387 additions than SSE additions. I also had some
> plans to extend this one mixed SSE/MMX/GPR integer arithmetics, but never
> got to that.
>
> This did not really fly becuase of the regalloc not really being able to
> understnad it (I made path to regclass to propagate the classes and figure out
> what operations needs to stay in i387 and what in SSE to avoid reloading, but
> that never got in).
>
> I believe Vladimir did some work on this with IRA (he is able to spill GPR
> regs into SSE and do bit of other tricks).
>
> Also I believe it was kind of Richard's design deicsion to avoid use of
> (paradoxical) subregs for vector conversions because these have funny
> implications.
>
> The code for handling upper parts of paradoxical subregs is controlled by
> macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle
> V1DI->V2DI conversions fluently without some middle-end hacking. (it will
> probably try to produce zero extensions)
>
> When we are on SSE instructions, it would be great to finally teach
> copy_by_pieces/store_by_pieces to use vector instructions (these are more
> compact and either equaly fast or faster on some CPUs). I hope to get into
> this, but it would be great if someone beat me.
>
> Honza
>


I'm trying to implement it as separate RTL pass which chooses a
scalar/vector mode for each 64bit computation chain and performs
transformation if we choose to use vectors. I also want to split DI
instructions which are going to be implemented on GPRs before RA
(currently it is done on the second split). Good metrics for such
transformation is a big question but currently I can't even make it
generate correct code when paradoxical subregs are used. It works in
simple cases but I get troubles when spills appear.

Trying to beat the following testcase:

test (long long *arr)
{
  register unsigned long long tmp;
  tmp = arr[0] | arr[1] & arr[2];
  while (tmp)
    {
      counter (tmp);
      tmp = *(arr++) & tmp;
    }
}

RTL I generate seems OK to me (ignoring the fact that it is not optimal):

(insn 6 3 50 2 (set (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ])
        (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
                (const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D)
+ 8B]+0 S8 A64])) pr65105-1.c:22 89 {*movdi_internal}
     (nil))
(insn 50 6 7 2 (set (reg:DI 104)
        (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
                (const_int 16 [0x10])) [2 MEM[(long long int
*)arr_5(D) + 16B]+0 S8 A64])) pr65105-1.c:22 -1
     (nil))
(insn 7 50 51 2 (set (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
        (and:V2DI (subreg:V2DI (reg:DI 98 [ MEM[(long long int
*)arr_5(D) + 8B] ]) 0)
            (subreg:V2DI (reg:DI 104) 0))) pr65105-1.c:22 3487 {*andv2di3}
     (expr_list:REG_DEAD (subreg:V2DI (reg:DI 98 [ MEM[(long long int
*)arr_5(D) + 8B] ]) 0)
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (expr_list:REG_EQUAL (and:DI (mem:DI (plus:SI (reg/v/f:SI
96 [ arr ])
                            (const_int 8 [0x8])) [2 MEM[(long long int
*)arr_5(D) + 8B]+0 S8 A64])
                    (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
                            (const_int 16 [0x10])) [2 MEM[(long long
int *)arr_5(D) + 16B]+0 S8 A64]))
                (nil)))))
(insn 51 7 8 2 (set (reg:DI 105)
        (mem:DI (reg/v/f:SI 96 [ arr ]) [2 *arr_5(D)+0 S8 A64]))
pr65105-1.c:22 -1
     (nil))
(insn 8 51 46 2 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)
        (ior:V2DI (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
            (subreg:V2DI (reg:DI 105) 0))) pr65105-1.c:22 3489 {*iorv2di3}
     (expr_list:REG_DEAD (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))
(insn 46 8 47 2 (set (reg:V2DI 103)
        (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:22 -1
     (nil))
(insn 47 46 48 2 (set (subreg:SI (reg:DI 101) 0)
        (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1
     (nil))
(insn 48 47 49 2 (set (reg:V2DI 103)
        (lshiftrt:V2DI (reg:V2DI 103)
            (const_int 32 [0x20]))) pr65105-1.c:22 -1
     (nil))
(insn 49 48 9 2 (set (subreg:SI (reg:DI 101) 4)
        (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1
     (nil))
(note 9 49 10 2 NOTE_INSN_DELETED)
(insn 10 9 11 2 (parallel [
            (set (reg:CCZ 17 flags)
                (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4)
                        (subreg:SI (reg:DI 101) 0))
                    (const_int 0 [0])))
            (clobber (scratch:SI))
        ]) pr65105-1.c:23 447 {*iorsi_3}
     (nil))
(jump_insn 11 10 37 2 (set (pc)
        (if_then_else (ne (reg:CCZ 17 flags)
                (const_int 0 [0]))
            (label_ref:SI 37)
            (pc))) pr65105-1.c:23 619 {*jcc_1}
     (expr_list:REG_DEAD (reg:CCZ 17 flags)
        (int_list:REG_BR_PROB 9100 (nil)))
 -> 37)
(code_label 37 11 36 3 11 "" [2 uses])
(note 36 37 18 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
(insn 18 36 19 3 (set (mem:DI (reg/f:SI 7 sp) [0  S8 A32])
        (reg/v:DI 87 [ tmp ])) pr65105-1.c:25 89 {*movdi_internal}
     (nil))
(call_insn 19 18 20 3 (call (mem:QI (symbol_ref:SI ("counter") [flags
0x3]  <function_decl 0x7f94046ea798 counter>) [0 counter S1 A8])
        (const_int 8 [0x8])) pr65105-1.c:25 666 {*call}
     (expr_list:REG_CALL_DECL (symbol_ref:SI ("counter") [flags 0x3]
<function_decl 0x7f94046ea798 counter>)
        (expr_list:REG_EH_REGION (const_int 0 [0])
            (nil)))
    (expr_list:DI (use (mem:DI (reg/f:SI 7 sp) [0  S8 A32]))
        (nil)))
(insn 20 19 52 3 (parallel [
            (set (reg/v/f:SI 96 [ arr ])
                (plus:SI (reg/v/f:SI 96 [ arr ])
                    (const_int 8 [0x8])))
            (clobber (reg:CC 17 flags))
        ]) pr65105-1.c:26 220 {*addsi_1}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (nil)))
(insn 52 20 21 3 (set (reg:DI 106)
        (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
                (const_int -8 [0xfffffffffffffff8])) [2 MEM[base:
arr_14, offset: 4294967288B]+0 S8 A64])) pr65105-1.c:26 -1
     (nil))
(insn 21 52 42 3 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)
        (and:V2DI (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)
            (subreg:V2DI (reg:DI 106) 0))) pr65105-1.c:26 3487 {*andv2di3}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (nil)))
(insn 42 21 43 3 (set (reg:V2DI 102)
        (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:26 -1
     (nil))
(insn 43 42 44 3 (set (subreg:SI (reg:DI 101) 0)
        (subreg:SI (reg:V2DI 102) 0)) pr65105-1.c:26 -1
     (nil))
(insn 44 43 45 3 (set (reg:V2DI 102)
        (lshiftrt:V2DI (reg:V2DI 102)
            (const_int 32 [0x20]))) pr65105-1.c:26 -1
     (nil))
(insn 45 44 23 3 (set (subreg:SI (reg:DI 101) 4)
        (subreg:SI (reg:V2DI 102) 0)) pr65105-1.c:26 -1
     (nil))
(note 23 45 24 3 NOTE_INSN_DELETED)
(insn 24 23 25 3 (parallel [
            (set (reg:CCZ 17 flags)
                (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4)
                        (subreg:SI (reg:DI 101) 0))
                    (const_int 0 [0])))
            (clobber (scratch:SI))
        ]) pr65105-1.c:23 447 {*iorsi_3}
     (nil))
(jump_insn 25 24 30 3 (set (pc)
        (if_then_else (ne (reg:CCZ 17 flags)
                (const_int 0 [0]))
            (label_ref:SI 37)
            (pc))) pr65105-1.c:23 619 {*jcc_1}
     (expr_list:REG_DEAD (reg:CCZ 17 flags)
        (int_list:REG_BR_PROB 9100 (nil)))
 -> 37)


r87 [tmp] has one definition before the loop (insn 8) and one
definition in the loop (insn 21). But after reload I see that insn 8
result is stored into stack and this stored value is used in the loop.
But value produced in in 21 is not stored into stack and therefore
wrong value is used starting from the second loop iteration. Here is
the resulting assembler:

test:
.LFB10:
        .cfi_startproc
        pushl   %ebx
        .cfi_def_cfa_offset 8
        .cfi_offset 3, -8
        leal    -40(%esp), %esp
        .cfi_def_cfa_offset 48
        movl    48(%esp), %ebx
        movq    8(%ebx), %xmm1
        movq    16(%ebx), %xmm0
        pand    %xmm1, %xmm0
        movq    (%ebx), %xmm1
        movdqa  %xmm0, %xmm4
        por     %xmm1, %xmm4
        movdqa  %xmm4, %xmm0
        movd    %xmm4, %edx
        **movq    %xmm4, 16(%esp)**
        psrlq   $32, %xmm0
        movd    %xmm0, %eax
        orl     %edx, %eax
        je      .L7
        .p2align 4,,15
.L11:
        **movl    16(%esp), %eax**
        addl    $8, %ebx
        **movl    20(%esp), %edx**
        movl    %eax, (%esp)
        movl    %edx, 4(%esp)
        call    counter
        movq    -8(%ebx), %xmm0
        **movdqa  16(%esp), %xmm2**
        pand    %xmm0, %xmm2
        movdqa  %xmm2, %xmm0
        movd    %xmm2, %edx
        psrlq   $32, %xmm0
        movd    %xmm0, %eax
        orl     %edx, %eax
        jne     .L11
.L7:
        leal    40(%esp), %esp
        .cfi_def_cfa_offset 8
        popl    %ebx
        .cfi_restore 3
        .cfi_def_cfa_offset 4
        ret

Do I misuse paradoxical subregs? Is there any other way to mix scalar
and vector code and perform vector casts?

BTW this test works OK on another optset when r87 is not spilled into
a memory but is preserved on GPRs through the call instead.

Thanks,
Ilya

Re: [i386] Scalar DImode instructions on XMM registers

Reply via email to