On 05/18/2015 08:13 AM, Ilya Enkovich wrote:
2015-05-06 17:18 GMT+03:00 Ilya Enkovich <enkovich....@gmail.com>:
2015-04-25 4:32 GMT+03:00 Jan Hubicka <hubi...@ucw.cz>:
Hi,
I am adding Vladimir and Richard into CC. I tried to solve similar problem
with FP math years ago by having -mfpmath=sse,i387. The idea was to allow
use of i387 registers when SSE ones run out and possibly also model the fact
that Pentium4 had faster i387 additions than SSE additions. I also had some
plans to extend this one mixed SSE/MMX/GPR integer arithmetics, but never
got to that.

This did not really fly becuase of the regalloc not really being able to
understnad it (I made path to regclass to propagate the classes and figure out
what operations needs to stay in i387 and what in SSE to avoid reloading, but
that never got in).

I believe Vladimir did some work on this with IRA (he is able to spill GPR
regs into SSE and do bit of other tricks).

Also I believe it was kind of Richard's design deicsion to avoid use of
(paradoxical) subregs for vector conversions because these have funny
implications.

The code for handling upper parts of paradoxical subregs is controlled by
macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle
V1DI->V2DI conversions fluently without some middle-end hacking. (it will
probably try to produce zero extensions)

When we are on SSE instructions, it would be great to finally teach
copy_by_pieces/store_by_pieces to use vector instructions (these are more
compact and either equaly fast or faster on some CPUs). I hope to get into
this, but it would be great if someone beat me.

Honza

I'm trying to implement it as separate RTL pass which chooses a
scalar/vector mode for each 64bit computation chain and performs
transformation if we choose to use vectors. I also want to split DI
instructions which are going to be implemented on GPRs before RA
(currently it is done on the second split). Good metrics for such
transformation is a big question but currently I can't even make it
generate correct code when paradoxical subregs are used. It works in
simple cases but I get troubles when spills appear.

Trying to beat the following testcase:

test (long long *arr)
{
   register unsigned long long tmp;
   tmp = arr[0] | arr[1] & arr[2];
   while (tmp)
     {
       counter (tmp);
       tmp = *(arr++) & tmp;
     }
}

RTL I generate seems OK to me (ignoring the fact that it is not optimal):

(insn 6 3 50 2 (set (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ])
         (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
                 (const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D)
+ 8B]+0 S8 A64])) pr65105-1.c:22 89 {*movdi_internal}
      (nil))
(insn 50 6 7 2 (set (reg:DI 104)
         (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
                 (const_int 16 [0x10])) [2 MEM[(long long int
*)arr_5(D) + 16B]+0 S8 A64])) pr65105-1.c:22 -1
      (nil))
(insn 7 50 51 2 (set (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
         (and:V2DI (subreg:V2DI (reg:DI 98 [ MEM[(long long int
*)arr_5(D) + 8B] ]) 0)
             (subreg:V2DI (reg:DI 104) 0))) pr65105-1.c:22 3487 {*andv2di3}
      (expr_list:REG_DEAD (subreg:V2DI (reg:DI 98 [ MEM[(long long int
*)arr_5(D) + 8B] ]) 0)
         (expr_list:REG_UNUSED (reg:CC 17 flags)
             (expr_list:REG_EQUAL (and:DI (mem:DI (plus:SI (reg/v/f:SI
96 [ arr ])
                             (const_int 8 [0x8])) [2 MEM[(long long int
*)arr_5(D) + 8B]+0 S8 A64])
                     (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
                             (const_int 16 [0x10])) [2 MEM[(long long
int *)arr_5(D) + 16B]+0 S8 A64]))
                 (nil)))))
(insn 51 7 8 2 (set (reg:DI 105)
         (mem:DI (reg/v/f:SI 96 [ arr ]) [2 *arr_5(D)+0 S8 A64]))
pr65105-1.c:22 -1
      (nil))
(insn 8 51 46 2 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)
         (ior:V2DI (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
             (subreg:V2DI (reg:DI 105) 0))) pr65105-1.c:22 3489 {*iorv2di3}
      (expr_list:REG_DEAD (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
         (expr_list:REG_UNUSED (reg:CC 17 flags)
             (nil))))
(insn 46 8 47 2 (set (reg:V2DI 103)
         (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:22 -1
      (nil))
(insn 47 46 48 2 (set (subreg:SI (reg:DI 101) 0)
         (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1
      (nil))
(insn 48 47 49 2 (set (reg:V2DI 103)
         (lshiftrt:V2DI (reg:V2DI 103)
             (const_int 32 [0x20]))) pr65105-1.c:22 -1
      (nil))
(insn 49 48 9 2 (set (subreg:SI (reg:DI 101) 4)
         (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1
      (nil))
(note 9 49 10 2 NOTE_INSN_DELETED)
(insn 10 9 11 2 (parallel [
             (set (reg:CCZ 17 flags)
                 (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4)
                         (subreg:SI (reg:DI 101) 0))
                     (const_int 0 [0])))
             (clobber (scratch:SI))
         ]) pr65105-1.c:23 447 {*iorsi_3}
      (nil))
(jump_insn 11 10 37 2 (set (pc)
         (if_then_else (ne (reg:CCZ 17 flags)
                 (const_int 0 [0]))
             (label_ref:SI 37)
             (pc))) pr65105-1.c:23 619 {*jcc_1}
      (expr_list:REG_DEAD (reg:CCZ 17 flags)
         (int_list:REG_BR_PROB 9100 (nil)))
  -> 37)
(code_label 37 11 36 3 11 "" [2 uses])
(note 36 37 18 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
(insn 18 36 19 3 (set (mem:DI (reg/f:SI 7 sp) [0  S8 A32])
         (reg/v:DI 87 [ tmp ])) pr65105-1.c:25 89 {*movdi_internal}
      (nil))
(call_insn 19 18 20 3 (call (mem:QI (symbol_ref:SI ("counter") [flags
0x3]  <function_decl 0x7f94046ea798 counter>) [0 counter S1 A8])
         (const_int 8 [0x8])) pr65105-1.c:25 666 {*call}
      (expr_list:REG_CALL_DECL (symbol_ref:SI ("counter") [flags 0x3]
<function_decl 0x7f94046ea798 counter>)
         (expr_list:REG_EH_REGION (const_int 0 [0])
             (nil)))
     (expr_list:DI (use (mem:DI (reg/f:SI 7 sp) [0  S8 A32]))
         (nil)))
(insn 20 19 52 3 (parallel [
             (set (reg/v/f:SI 96 [ arr ])
                 (plus:SI (reg/v/f:SI 96 [ arr ])
                     (const_int 8 [0x8])))
             (clobber (reg:CC 17 flags))
         ]) pr65105-1.c:26 220 {*addsi_1}
      (expr_list:REG_UNUSED (reg:CC 17 flags)
         (nil)))
(insn 52 20 21 3 (set (reg:DI 106)
         (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
                 (const_int -8 [0xfffffffffffffff8])) [2 MEM[base:
arr_14, offset: 4294967288B]+0 S8 A64])) pr65105-1.c:26 -1
      (nil))
(insn 21 52 42 3 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)
         (and:V2DI (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)
             (subreg:V2DI (reg:DI 106) 0))) pr65105-1.c:26 3487 {*andv2di3}
      (expr_list:REG_UNUSED (reg:CC 17 flags)
         (nil)))
(insn 42 21 43 3 (set (reg:V2DI 102)
         (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:26 -1
      (nil))
(insn 43 42 44 3 (set (subreg:SI (reg:DI 101) 0)
         (subreg:SI (reg:V2DI 102) 0)) pr65105-1.c:26 -1
      (nil))
(insn 44 43 45 3 (set (reg:V2DI 102)
         (lshiftrt:V2DI (reg:V2DI 102)
             (const_int 32 [0x20]))) pr65105-1.c:26 -1
      (nil))
(insn 45 44 23 3 (set (subreg:SI (reg:DI 101) 4)
         (subreg:SI (reg:V2DI 102) 0)) pr65105-1.c:26 -1
      (nil))
(note 23 45 24 3 NOTE_INSN_DELETED)
(insn 24 23 25 3 (parallel [
             (set (reg:CCZ 17 flags)
                 (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4)
                         (subreg:SI (reg:DI 101) 0))
                     (const_int 0 [0])))
             (clobber (scratch:SI))
         ]) pr65105-1.c:23 447 {*iorsi_3}
      (nil))
(jump_insn 25 24 30 3 (set (pc)
         (if_then_else (ne (reg:CCZ 17 flags)
                 (const_int 0 [0]))
             (label_ref:SI 37)
             (pc))) pr65105-1.c:23 619 {*jcc_1}
      (expr_list:REG_DEAD (reg:CCZ 17 flags)
         (int_list:REG_BR_PROB 9100 (nil)))
  -> 37)


r87 [tmp] has one definition before the loop (insn 8) and one
definition in the loop (insn 21). But after reload I see that insn 8
result is stored into stack and this stored value is used in the loop.
But value produced in in 21 is not stored into stack and therefore
wrong value is used starting from the second loop iteration. Here is
the resulting assembler:

test:
.LFB10:
         .cfi_startproc
         pushl   %ebx
         .cfi_def_cfa_offset 8
         .cfi_offset 3, -8
         leal    -40(%esp), %esp
         .cfi_def_cfa_offset 48
         movl    48(%esp), %ebx
         movq    8(%ebx), %xmm1
         movq    16(%ebx), %xmm0
         pand    %xmm1, %xmm0
         movq    (%ebx), %xmm1
         movdqa  %xmm0, %xmm4
         por     %xmm1, %xmm4
         movdqa  %xmm4, %xmm0
         movd    %xmm4, %edx
         **movq    %xmm4, 16(%esp)**
         psrlq   $32, %xmm0
         movd    %xmm0, %eax
         orl     %edx, %eax
         je      .L7
         .p2align 4,,15
.L11:
         **movl    16(%esp), %eax**
         addl    $8, %ebx
         **movl    20(%esp), %edx**
         movl    %eax, (%esp)
         movl    %edx, 4(%esp)
         call    counter
         movq    -8(%ebx), %xmm0
         **movdqa  16(%esp), %xmm2**
         pand    %xmm0, %xmm2
         movdqa  %xmm2, %xmm0
         movd    %xmm2, %edx
         psrlq   $32, %xmm0
         movd    %xmm0, %eax
         orl     %edx, %eax
         jne     .L11
.L7:
         leal    40(%esp), %esp
         .cfi_def_cfa_offset 8
         popl    %ebx
         .cfi_restore 3
         .cfi_def_cfa_offset 4
         ret

Do I misuse paradoxical subregs? Is there any other way to mix scalar
and vector code and perform vector casts?

BTW this test works OK on another optset when r87 is not spilled into
a memory but is preserved on GPRs through the call instead.

Thanks,
Ilya
Hi Vladimir,

Could you please comment on this?


Ilya, I think that the idea is worth to try but results might be mixed. It is hard to say until you actually try it (as example, Jan implemented -fpmath=both and it looks a pretty good idea at least for me but when I checked SPEC2000 the results were not so good even with IRA/LRA).

Long ago I did some experiments and found that spilling into SSE would benefitial for Intel CPUs but not for AMD ones. As I remember I also found that storing several scalar values into one SSE reg and extracting it when you need to do some (fp) arithmetics would benefitial for AMD but not for Intel CPUs. In literature more general approach is called bitwise register allocator. Actually it would be a pretty big IRA/LRA project from which some targets might benefit.


As for the wrong code, it is hard for me to say anything w/o RA dumps. If you send me the dump (-fira-verbose=16), i might say more what is going on.


Reply via email to