2015-04-25 4:32 GMT+03:00 Jan Hubicka <hubi...@ucw.cz>: > Hi, > I am adding Vladimir and Richard into CC. I tried to solve similar problem > with FP math years ago by having -mfpmath=sse,i387. The idea was to allow > use of i387 registers when SSE ones run out and possibly also model the fact > that Pentium4 had faster i387 additions than SSE additions. I also had some > plans to extend this one mixed SSE/MMX/GPR integer arithmetics, but never > got to that. > > This did not really fly becuase of the regalloc not really being able to > understnad it (I made path to regclass to propagate the classes and figure out > what operations needs to stay in i387 and what in SSE to avoid reloading, but > that never got in). > > I believe Vladimir did some work on this with IRA (he is able to spill GPR > regs into SSE and do bit of other tricks). > > Also I believe it was kind of Richard's design deicsion to avoid use of > (paradoxical) subregs for vector conversions because these have funny > implications. > > The code for handling upper parts of paradoxical subregs is controlled by > macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle > V1DI->V2DI conversions fluently without some middle-end hacking. (it will > probably try to produce zero extensions) > > When we are on SSE instructions, it would be great to finally teach > copy_by_pieces/store_by_pieces to use vector instructions (these are more > compact and either equaly fast or faster on some CPUs). I hope to get into > this, but it would be great if someone beat me. > > Honza >
I'm trying to implement it as separate RTL pass which chooses a scalar/vector mode for each 64bit computation chain and performs transformation if we choose to use vectors. I also want to split DI instructions which are going to be implemented on GPRs before RA (currently it is done on the second split). Good metrics for such transformation is a big question but currently I can't even make it generate correct code when paradoxical subregs are used. It works in simple cases but I get troubles when spills appear. Trying to beat the following testcase: test (long long *arr) { register unsigned long long tmp; tmp = arr[0] | arr[1] & arr[2]; while (tmp) { counter (tmp); tmp = *(arr++) & tmp; } } RTL I generate seems OK to me (ignoring the fact that it is not optimal): (insn 6 3 50 2 (set (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ]) (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D) + 8B]+0 S8 A64])) pr65105-1.c:22 89 {*movdi_internal} (nil)) (insn 50 6 7 2 (set (reg:DI 104) (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 16 [0x10])) [2 MEM[(long long int *)arr_5(D) + 16B]+0 S8 A64])) pr65105-1.c:22 -1 (nil)) (insn 7 50 51 2 (set (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0) (and:V2DI (subreg:V2DI (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ]) 0) (subreg:V2DI (reg:DI 104) 0))) pr65105-1.c:22 3487 {*andv2di3} (expr_list:REG_DEAD (subreg:V2DI (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ]) 0) (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_EQUAL (and:DI (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D) + 8B]+0 S8 A64]) (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 16 [0x10])) [2 MEM[(long long int *)arr_5(D) + 16B]+0 S8 A64])) (nil))))) (insn 51 7 8 2 (set (reg:DI 105) (mem:DI (reg/v/f:SI 96 [ arr ]) [2 *arr_5(D)+0 S8 A64])) pr65105-1.c:22 -1 (nil)) (insn 8 51 46 2 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0) (ior:V2DI (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0) (subreg:V2DI (reg:DI 105) 0))) pr65105-1.c:22 3489 {*iorv2di3} (expr_list:REG_DEAD (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) (insn 46 8 47 2 (set (reg:V2DI 103) (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:22 -1 (nil)) (insn 47 46 48 2 (set (subreg:SI (reg:DI 101) 0) (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1 (nil)) (insn 48 47 49 2 (set (reg:V2DI 103) (lshiftrt:V2DI (reg:V2DI 103) (const_int 32 [0x20]))) pr65105-1.c:22 -1 (nil)) (insn 49 48 9 2 (set (subreg:SI (reg:DI 101) 4) (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1 (nil)) (note 9 49 10 2 NOTE_INSN_DELETED) (insn 10 9 11 2 (parallel [ (set (reg:CCZ 17 flags) (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4) (subreg:SI (reg:DI 101) 0)) (const_int 0 [0]))) (clobber (scratch:SI)) ]) pr65105-1.c:23 447 {*iorsi_3} (nil)) (jump_insn 11 10 37 2 (set (pc) (if_then_else (ne (reg:CCZ 17 flags) (const_int 0 [0])) (label_ref:SI 37) (pc))) pr65105-1.c:23 619 {*jcc_1} (expr_list:REG_DEAD (reg:CCZ 17 flags) (int_list:REG_BR_PROB 9100 (nil))) -> 37) (code_label 37 11 36 3 11 "" [2 uses]) (note 36 37 18 3 [bb 3] NOTE_INSN_BASIC_BLOCK) (insn 18 36 19 3 (set (mem:DI (reg/f:SI 7 sp) [0 S8 A32]) (reg/v:DI 87 [ tmp ])) pr65105-1.c:25 89 {*movdi_internal} (nil)) (call_insn 19 18 20 3 (call (mem:QI (symbol_ref:SI ("counter") [flags 0x3] <function_decl 0x7f94046ea798 counter>) [0 counter S1 A8]) (const_int 8 [0x8])) pr65105-1.c:25 666 {*call} (expr_list:REG_CALL_DECL (symbol_ref:SI ("counter") [flags 0x3] <function_decl 0x7f94046ea798 counter>) (expr_list:REG_EH_REGION (const_int 0 [0]) (nil))) (expr_list:DI (use (mem:DI (reg/f:SI 7 sp) [0 S8 A32])) (nil))) (insn 20 19 52 3 (parallel [ (set (reg/v/f:SI 96 [ arr ]) (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 8 [0x8]))) (clobber (reg:CC 17 flags)) ]) pr65105-1.c:26 220 {*addsi_1} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) (insn 52 20 21 3 (set (reg:DI 106) (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int -8 [0xfffffffffffffff8])) [2 MEM[base: arr_14, offset: 4294967288B]+0 S8 A64])) pr65105-1.c:26 -1 (nil)) (insn 21 52 42 3 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0) (and:V2DI (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0) (subreg:V2DI (reg:DI 106) 0))) pr65105-1.c:26 3487 {*andv2di3} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) (insn 42 21 43 3 (set (reg:V2DI 102) (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:26 -1 (nil)) (insn 43 42 44 3 (set (subreg:SI (reg:DI 101) 0) (subreg:SI (reg:V2DI 102) 0)) pr65105-1.c:26 -1 (nil)) (insn 44 43 45 3 (set (reg:V2DI 102) (lshiftrt:V2DI (reg:V2DI 102) (const_int 32 [0x20]))) pr65105-1.c:26 -1 (nil)) (insn 45 44 23 3 (set (subreg:SI (reg:DI 101) 4) (subreg:SI (reg:V2DI 102) 0)) pr65105-1.c:26 -1 (nil)) (note 23 45 24 3 NOTE_INSN_DELETED) (insn 24 23 25 3 (parallel [ (set (reg:CCZ 17 flags) (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4) (subreg:SI (reg:DI 101) 0)) (const_int 0 [0]))) (clobber (scratch:SI)) ]) pr65105-1.c:23 447 {*iorsi_3} (nil)) (jump_insn 25 24 30 3 (set (pc) (if_then_else (ne (reg:CCZ 17 flags) (const_int 0 [0])) (label_ref:SI 37) (pc))) pr65105-1.c:23 619 {*jcc_1} (expr_list:REG_DEAD (reg:CCZ 17 flags) (int_list:REG_BR_PROB 9100 (nil))) -> 37) r87 [tmp] has one definition before the loop (insn 8) and one definition in the loop (insn 21). But after reload I see that insn 8 result is stored into stack and this stored value is used in the loop. But value produced in in 21 is not stored into stack and therefore wrong value is used starting from the second loop iteration. Here is the resulting assembler: test: .LFB10: .cfi_startproc pushl %ebx .cfi_def_cfa_offset 8 .cfi_offset 3, -8 leal -40(%esp), %esp .cfi_def_cfa_offset 48 movl 48(%esp), %ebx movq 8(%ebx), %xmm1 movq 16(%ebx), %xmm0 pand %xmm1, %xmm0 movq (%ebx), %xmm1 movdqa %xmm0, %xmm4 por %xmm1, %xmm4 movdqa %xmm4, %xmm0 movd %xmm4, %edx **movq %xmm4, 16(%esp)** psrlq $32, %xmm0 movd %xmm0, %eax orl %edx, %eax je .L7 .p2align 4,,15 .L11: **movl 16(%esp), %eax** addl $8, %ebx **movl 20(%esp), %edx** movl %eax, (%esp) movl %edx, 4(%esp) call counter movq -8(%ebx), %xmm0 **movdqa 16(%esp), %xmm2** pand %xmm0, %xmm2 movdqa %xmm2, %xmm0 movd %xmm2, %edx psrlq $32, %xmm0 movd %xmm0, %eax orl %edx, %eax jne .L11 .L7: leal 40(%esp), %esp .cfi_def_cfa_offset 8 popl %ebx .cfi_restore 3 .cfi_def_cfa_offset 4 ret Do I misuse paradoxical subregs? Is there any other way to mix scalar and vector code and perform vector casts? BTW this test works OK on another optset when r87 is not spilled into a memory but is preserved on GPRs through the call instead. Thanks, Ilya