Re: [i386] Scalar DImode instructions on XMM registers
On 05/27/2015 07:20 AM, Ilya Enkovich wrote: I looked into assign_stack_local_1 call for this spill. LRA correctly requests 16 bytes size with 16 bytes alignment. But assign_stack_local_1 look reduces alignment to 8 because estimated stack alignment before RA is 8 and requested mode's (DI) alignment fits it. Probably LRA should pass biggest_mode of the reg when requesting a stack slot? It's hard to say for sure. Within the lra_reg structure, biggest_mode refers to the largest mode in which a pseudo is referenced. So for a pseudo it might make sense. Presumably the biggest_mode for the pseudo in question is larger than DImode, right? I handled it by increasing stack_alignment_estimated when transform some instructions to vector mode. I haven't looked deeply, but if your pass runs after stack_alignment_estimated is initially computed, then this seems like a desirable way to fix the problem. jeff
Re: [i386] Scalar DImode instructions on XMM registers
2015-05-27 6:31 GMT+03:00 Jeff Law l...@redhat.com: On 05/25/2015 09:27 AM, Ilya Enkovich wrote: 2015-05-22 15:01 GMT+03:00 Ilya Enkovich enkovich@gmail.com: 2015-05-22 11:53 GMT+03:00 Ilya Enkovich enkovich@gmail.com: 2015-05-21 22:08 GMT+03:00 Vladimir Makarov vmaka...@redhat.com: So, Ilya, to solve the problem you need to avoid sharing subregs for the correct LRA/reload work. Thanks a lot for your help! I'll fix it. Ilya I've fixed SUBREG sharing and got a missing spill. I added --enable-checking=rtl to check other possible bugs. Spill/fill code still seems incorrect because different sizes are used. Shouldn't block me though. .L5: movl16(%esp), %eax addl$8, %esi movl20(%esp), %edx movl%eax, (%esp) movl%edx, 4(%esp) callcounter@PLT movq-8(%esi), %xmm0 **movdqa 16(%esp), %xmm2** pand%xmm0, %xmm2 movdqa %xmm2, %xmm0 movd%xmm2, %edx **movq%xmm2, 16(%esp)** psrlq $32, %xmm0 movd%xmm0, %eax orl %edx, %eax jne .L5 Thanks, Ilya I was wrong assuming reloads with wrong size shouldn't block me. These reloads require memory to be aligned which is not always true. Here is what I have in RTL now: (insn 2 7 3 2 (set (reg/v:DI 93 [ l ]) (mem/c:DI (reg/f:SI 16 argp) [1 l+0 S8 A32])) test.c:5 89 {*movdi_internal} (nil)) ... (insn 27 26 52 6 (set (subreg:V2DI (reg:DI 87 [ D.1822 ]) 0) (ior:V2DI (subreg:V2DI (reg:DI 99 [ D.1822 ]) 0) (subreg:V2DI (reg/v:DI 93 [ l ]) 0))) test.c:11 3489 {*iorv2di3} (expr_list:REG_DEAD (reg:DI 99 [ D.1822 ]) (expr_list:REG_DEAD (reg/v:DI 93 [ l ]) (nil After reload I get: (insn 2 7 75 2 (set (reg/v:DI 0 ax [orig:93 l ] [93]) (mem/c:DI (plus:SI (reg/f:SI 7 sp) (const_int 24 [0x18])) [1 l+0 S8 A32])) test.c:5 89 {*movdi_internal} (nil)) (insn 75 2 3 2 (set (mem/c:DI (reg/f:SI 7 sp) [3 %sfp+-16 S8 A64]) (reg/v:DI 0 ax [orig:93 l ] [93])) test.c:5 89 {*movdi_internal} (nil)) ... (insn 27 26 52 6 (set (reg:V2DI 21 xmm0 [orig:87 D.1822 ] [87]) (ior:V2DI (reg:V2DI 21 xmm0 [orig:99 D.1822 ] [99]) (mem/c:V2DI (reg/f:SI 7 sp) [3 %sfp+-16 S16 A64]))) test.c:11 3489 {*iorv2di3} 'por' instruction requires memory to be aligned and fails in a bigger testcase. There is also movdqa generated for esp by reload. May it mean I still have some inconsistencies in the produced RTL? Probably I should somehow transform loads and stores? I'd start by looking at the AP-SP elimination step. What's the defined stack alignment and whether or not a dynamic stack realignment is needed. If you don't have all that setup properly prior to the allocators, then they're not going to know how what objects to align nor how to align them. I looked into assign_stack_local_1 call for this spill. LRA correctly requests 16 bytes size with 16 bytes alignment. But assign_stack_local_1 look reduces alignment to 8 because estimated stack alignment before RA is 8 and requested mode's (DI) alignment fits it. Probably LRA should pass biggest_mode of the reg when requesting a stack slot? I handled it by increasing stack_alignment_estimated when transform some instructions to vector mode. Thanks for help! Ilya jeff
Re: [i386] Scalar DImode instructions on XMM registers
On 05/25/2015 09:27 AM, Ilya Enkovich wrote: 2015-05-22 15:01 GMT+03:00 Ilya Enkovich enkovich@gmail.com: 2015-05-22 11:53 GMT+03:00 Ilya Enkovich enkovich@gmail.com: 2015-05-21 22:08 GMT+03:00 Vladimir Makarov vmaka...@redhat.com: So, Ilya, to solve the problem you need to avoid sharing subregs for the correct LRA/reload work. Thanks a lot for your help! I'll fix it. Ilya I've fixed SUBREG sharing and got a missing spill. I added --enable-checking=rtl to check other possible bugs. Spill/fill code still seems incorrect because different sizes are used. Shouldn't block me though. .L5: movl16(%esp), %eax addl$8, %esi movl20(%esp), %edx movl%eax, (%esp) movl%edx, 4(%esp) callcounter@PLT movq-8(%esi), %xmm0 **movdqa 16(%esp), %xmm2** pand%xmm0, %xmm2 movdqa %xmm2, %xmm0 movd%xmm2, %edx **movq%xmm2, 16(%esp)** psrlq $32, %xmm0 movd%xmm0, %eax orl %edx, %eax jne .L5 Thanks, Ilya I was wrong assuming reloads with wrong size shouldn't block me. These reloads require memory to be aligned which is not always true. Here is what I have in RTL now: (insn 2 7 3 2 (set (reg/v:DI 93 [ l ]) (mem/c:DI (reg/f:SI 16 argp) [1 l+0 S8 A32])) test.c:5 89 {*movdi_internal} (nil)) ... (insn 27 26 52 6 (set (subreg:V2DI (reg:DI 87 [ D.1822 ]) 0) (ior:V2DI (subreg:V2DI (reg:DI 99 [ D.1822 ]) 0) (subreg:V2DI (reg/v:DI 93 [ l ]) 0))) test.c:11 3489 {*iorv2di3} (expr_list:REG_DEAD (reg:DI 99 [ D.1822 ]) (expr_list:REG_DEAD (reg/v:DI 93 [ l ]) (nil After reload I get: (insn 2 7 75 2 (set (reg/v:DI 0 ax [orig:93 l ] [93]) (mem/c:DI (plus:SI (reg/f:SI 7 sp) (const_int 24 [0x18])) [1 l+0 S8 A32])) test.c:5 89 {*movdi_internal} (nil)) (insn 75 2 3 2 (set (mem/c:DI (reg/f:SI 7 sp) [3 %sfp+-16 S8 A64]) (reg/v:DI 0 ax [orig:93 l ] [93])) test.c:5 89 {*movdi_internal} (nil)) ... (insn 27 26 52 6 (set (reg:V2DI 21 xmm0 [orig:87 D.1822 ] [87]) (ior:V2DI (reg:V2DI 21 xmm0 [orig:99 D.1822 ] [99]) (mem/c:V2DI (reg/f:SI 7 sp) [3 %sfp+-16 S16 A64]))) test.c:11 3489 {*iorv2di3} 'por' instruction requires memory to be aligned and fails in a bigger testcase. There is also movdqa generated for esp by reload. May it mean I still have some inconsistencies in the produced RTL? Probably I should somehow transform loads and stores? I'd start by looking at the AP-SP elimination step. What's the defined stack alignment and whether or not a dynamic stack realignment is needed. If you don't have all that setup properly prior to the allocators, then they're not going to know how what objects to align nor how to align them. jeff
Re: [i386] Scalar DImode instructions on XMM registers
2015-05-22 15:01 GMT+03:00 Ilya Enkovich enkovich@gmail.com: 2015-05-22 11:53 GMT+03:00 Ilya Enkovich enkovich@gmail.com: 2015-05-21 22:08 GMT+03:00 Vladimir Makarov vmaka...@redhat.com: So, Ilya, to solve the problem you need to avoid sharing subregs for the correct LRA/reload work. Thanks a lot for your help! I'll fix it. Ilya I've fixed SUBREG sharing and got a missing spill. I added --enable-checking=rtl to check other possible bugs. Spill/fill code still seems incorrect because different sizes are used. Shouldn't block me though. .L5: movl16(%esp), %eax addl$8, %esi movl20(%esp), %edx movl%eax, (%esp) movl%edx, 4(%esp) callcounter@PLT movq-8(%esi), %xmm0 **movdqa 16(%esp), %xmm2** pand%xmm0, %xmm2 movdqa %xmm2, %xmm0 movd%xmm2, %edx **movq%xmm2, 16(%esp)** psrlq $32, %xmm0 movd%xmm0, %eax orl %edx, %eax jne .L5 Thanks, Ilya I was wrong assuming reloads with wrong size shouldn't block me. These reloads require memory to be aligned which is not always true. Here is what I have in RTL now: (insn 2 7 3 2 (set (reg/v:DI 93 [ l ]) (mem/c:DI (reg/f:SI 16 argp) [1 l+0 S8 A32])) test.c:5 89 {*movdi_internal} (nil)) ... (insn 27 26 52 6 (set (subreg:V2DI (reg:DI 87 [ D.1822 ]) 0) (ior:V2DI (subreg:V2DI (reg:DI 99 [ D.1822 ]) 0) (subreg:V2DI (reg/v:DI 93 [ l ]) 0))) test.c:11 3489 {*iorv2di3} (expr_list:REG_DEAD (reg:DI 99 [ D.1822 ]) (expr_list:REG_DEAD (reg/v:DI 93 [ l ]) (nil After reload I get: (insn 2 7 75 2 (set (reg/v:DI 0 ax [orig:93 l ] [93]) (mem/c:DI (plus:SI (reg/f:SI 7 sp) (const_int 24 [0x18])) [1 l+0 S8 A32])) test.c:5 89 {*movdi_internal} (nil)) (insn 75 2 3 2 (set (mem/c:DI (reg/f:SI 7 sp) [3 %sfp+-16 S8 A64]) (reg/v:DI 0 ax [orig:93 l ] [93])) test.c:5 89 {*movdi_internal} (nil)) ... (insn 27 26 52 6 (set (reg:V2DI 21 xmm0 [orig:87 D.1822 ] [87]) (ior:V2DI (reg:V2DI 21 xmm0 [orig:99 D.1822 ] [99]) (mem/c:V2DI (reg/f:SI 7 sp) [3 %sfp+-16 S16 A64]))) test.c:11 3489 {*iorv2di3} 'por' instruction requires memory to be aligned and fails in a bigger testcase. There is also movdqa generated for esp by reload. May it mean I still have some inconsistencies in the produced RTL? Probably I should somehow transform loads and stores? Thanks, Ilya ira.log Description: Binary data pr65105.patch Description: Binary data extern long long arr[]; long long test (long long l, int i1, int i2) { switch (i2) { case 1: return l | arr[i1]; case 8: return l | arr[i1] arr[i2]; } return l; }
Re: [i386] Scalar DImode instructions on XMM registers
2015-05-21 22:08 GMT+03:00 Vladimir Makarov vmaka...@redhat.com: On 05/21/2015 05:54 AM, Ilya Enkovich wrote: Thanks. For me it looks like an inheritance bug. It is really hard to fix the bug w/o the source code. Could you send me your patch in order I can debug RA with it to investigate more. Sure! Here is a patch and a testcase. I applied patch to r222125. Cmd to reproduce: gcc -m32 -msse4.2 -O2 pr65105.c -S -march=slm -fPIE The problem is in sharing a subreg in different insns. Pseudo should be shared but not their subregs. We have before inheritance: 28: r132:V2DI=r132:V2DI|r126:DI#0 REG_DEAD r126:DI REG_DEAD r118:DI Inserting insn reload before: 81: r132:V2DI=r118:DI#0 Inserting insn reload after: 82: r108:DI#0=r132:V2DI ... Creating newreg=135, assigning class SSE_REGS to r135 42: r135:V2DI=r135:V2DIr108:DI#0 REG_DEAD r127:DI Inserting insn reload before: 85: r135:V2DI=r127:DI#0 Inserting insn reload after: 86: r108:DI#0=r135:V2DI As subreg of 108 in original insns 28 and 42 are shared, The subregs of 108 in insns 82 and 86 are shared too. During inheritance subpass we change r108 in insn 82 onto r137. This change insn 86 too. Creating newreg=137 from oldreg=108, assigning class NO_REX_SSE_REGS to inheritance r137 Original reg change 108-137 (bb2): 82: r137:DI#0=r132:V2DI REG_DEAD r132:V2DI Add original-inheritance after: 88: r108:DI=r137:DI Inheritance reuse change 108-137 (bb2): 68: r124:V2DI=r137:DI#0 And now we are trying to do inheritance for insn #86: Creating newreg=138 from oldreg=108, assigning class NO_REX_SSE_REGS to inheritance r138 Original reg change 108-138 (bb3): 86: r137:DI#0=r135:V2DI REG_DEAD r135:V2DI Add original-inheritance after: 89: r108:DI=r138:DI Inheritance reuse change 108-138 (bb3): 64: r123:V2DI=r137:DI#0 and after that having a complete mess. We are trying to change r108 onto r138, but r108 is already r137 because of sharing. Later we undo the second inheritance creating even more mess. So, Ilya, to solve the problem you need to avoid sharing subregs for the correct LRA/reload work. Thanks a lot for your help! I'll fix it. Ilya
Re: [i386] Scalar DImode instructions on XMM registers
2015-05-22 11:53 GMT+03:00 Ilya Enkovich enkovich@gmail.com: 2015-05-21 22:08 GMT+03:00 Vladimir Makarov vmaka...@redhat.com: So, Ilya, to solve the problem you need to avoid sharing subregs for the correct LRA/reload work. Thanks a lot for your help! I'll fix it. Ilya I've fixed SUBREG sharing and got a missing spill. I added --enable-checking=rtl to check other possible bugs. Spill/fill code still seems incorrect because different sizes are used. Shouldn't block me though. .L5: movl16(%esp), %eax addl$8, %esi movl20(%esp), %edx movl%eax, (%esp) movl%edx, 4(%esp) callcounter@PLT movq-8(%esi), %xmm0 **movdqa 16(%esp), %xmm2** pand%xmm0, %xmm2 movdqa %xmm2, %xmm0 movd%xmm2, %edx **movq%xmm2, 16(%esp)** psrlq $32, %xmm0 movd%xmm0, %eax orl %edx, %eax jne .L5 Thanks, Ilya
Re: [i386] Scalar DImode instructions on XMM registers
On 05/21/2015 01:08 PM, Vladimir Makarov wrote: On 05/21/2015 05:54 AM, Ilya Enkovich wrote: Thanks. For me it looks like an inheritance bug. It is really hard to fix the bug w/o the source code. Could you send me your patch in order I can debug RA with it to investigate more. Sure! Here is a patch and a testcase. I applied patch to r222125. Cmd to reproduce: gcc -m32 -msse4.2 -O2 pr65105.c -S -march=slm -fPIE The problem is in sharing a subreg in different insns. Pseudo should be shared but not their subregs. [ ... ] So, Ilya, to solve the problem you need to avoid sharing subregs for the correct LRA/reload work. If their code is sharing subregs, then most definitely that code is wrong. GCC has very well defined rtx sharing rules that are defined in the developer documentation. jeff
Re: [i386] Scalar DImode instructions on XMM registers
On Thu, May 21, 2015 at 02:23:47PM -0600, Jeff Law wrote: On 05/21/2015 01:08 PM, Vladimir Makarov wrote: On 05/21/2015 05:54 AM, Ilya Enkovich wrote: Thanks. For me it looks like an inheritance bug. It is really hard to fix the bug w/o the source code. Could you send me your patch in order I can debug RA with it to investigate more. Sure! Here is a patch and a testcase. I applied patch to r222125. Cmd to reproduce: gcc -m32 -msse4.2 -O2 pr65105.c -S -march=slm -fPIE The problem is in sharing a subreg in different insns. Pseudo should be shared but not their subregs. [ ... ] So, Ilya, to solve the problem you need to avoid sharing subregs for the correct LRA/reload work. If their code is sharing subregs, then most definitely that code is wrong. GCC has very well defined rtx sharing rules that are defined in the developer documentation. Shouldn't --enable-checking=rtl catch such bugs? Jakub
Re: [i386] Scalar DImode instructions on XMM registers
On 20 May 23:27, Vladimir Makarov wrote: On 20/05/15 04:17 AM, Ilya Enkovich wrote: On 19 May 11:22, Vladimir Makarov wrote: On 05/18/2015 08:13 AM, Ilya Enkovich wrote: 2015-05-06 17:18 GMT+03:00 Ilya Enkovich enkovich@gmail.com: Hi Vladimir, Could you please comment on this? Ilya, I think that the idea is worth to try but results might be mixed. It is hard to say until you actually try it (as example, Jan implemented -fpmath=both and it looks a pretty good idea at least for me but when I checked SPEC2000 the results were not so good even with IRA/LRA). Long ago I did some experiments and found that spilling into SSE would benefitial for Intel CPUs but not for AMD ones. As I remember I also found that storing several scalar values into one SSE reg and extracting it when you need to do some (fp) arithmetics would benefitial for AMD but not for Intel CPUs. In literature more general approach is called bitwise register allocator. Actually it would be a pretty big IRA/LRA project from which some targets might benefit. I suspect such things are not trivially done in IRA/LRA and want to make it as an independent optimization because its application seems to be quite narrow. Yes, that is true. The complications and implementation complexity will be probably very high in this project and the positive results are not sure. So the project might have a small value. As for the wrong code, it is hard for me to say anything w/o RA dumps. If you send me the dump (-fira-verbose=16), i might say more what is going on. Here are some dumps from my reproducer. The problematic register is r108. Thanks. For me it looks like an inheritance bug. It is really hard to fix the bug w/o the source code. Could you send me your patch in order I can debug RA with it to investigate more. Sure! Here is a patch and a testcase. I applied patch to r222125. Cmd to reproduce: gcc -m32 -msse4.2 -O2 pr65105.c -S -march=slm -fPIE Thanks, Ilya void counter (long long l); void test (long long *arr) { register unsigned long long tmp; tmp = arr[0] | arr[1] arr[2]; while (tmp) { counter (tmp); tmp = *(arr++) tmp; } } diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index a607ef4..a9dbfea 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2554,6 +2554,789 @@ rest_of_handle_insert_vzeroupper (void) return 0; } +static bool +has_non_address_hard_reg (rtx_insn *insn) +{ + df_ref ref; + FOR_EACH_INSN_DEF (ref, insn) +if (HARD_REGISTER_P (DF_REF_REAL_REG (ref)) +!DF_REF_FLAGS_IS_SET (ref, DF_REF_MUST_CLOBBER)) + return true; + + FOR_EACH_INSN_USE (ref, insn) +if (!DF_REF_REG_MEM_P(ref) HARD_REGISTER_P (DF_REF_REAL_REG (ref))) + return true; + + return false; +} + +static bool +scalar_to_vector_candidate_p (rtx_insn *insn) +{ + rtx def_set = single_set (insn); + + if (!def_set) +return false; + + if (has_non_address_hard_reg (insn)) +return false; + + rtx src = SET_SRC (def_set); + rtx dst = SET_DEST (def_set); + + /* We are interested in DImode - V1DI promotion + only. */ + if (GET_MODE (src) != DImode + || GET_MODE (dst) != DImode) +return false; + + if (!REG_P (dst) !MEM_P (dst)) +return false; + + switch (GET_CODE (src)) +{ +case PLUS: +case MINUS: +case IOR: +case XOR: +case AND: + break; + +default: + return false; +} + + if (!REG_P (XEXP (src, 0)) !MEM_P (XEXP (src, 0))) + return false; + + if (!REG_P (XEXP (src, 1)) !MEM_P (XEXP (src, 1))) + return false; + + if (GET_MODE (XEXP (src, 0)) != DImode + || GET_MODE (XEXP (src, 1)) != DImode) +return false; + + return true; +} + +/* Remove regs having both convertible and + not convertible definitions. */ +static void +remove_non_convertible_regs (bitmap insns) +{ + bitmap_iterator bi; + unsigned id; + bitmap regs = BITMAP_ALLOC (NULL); + + EXECUTE_IF_SET_IN_BITMAP (insns, 0, id, bi) +{ + rtx def_set = single_set (DF_INSN_UID_GET (id)-insn); + rtx reg = SET_DEST (def_set); + + if (!REG_P (reg) || bitmap_bit_p (regs, REGNO (reg))) + continue; + + for (df_ref def = DF_REG_DEF_CHAIN (REGNO (reg)); + def; + def = DF_REF_NEXT_REG (def)) + { + if (!bitmap_bit_p (insns, DF_REF_INSN_UID (def))) + { + if (dump_file) + fprintf (dump_file, +r%d has non convertible definition in insn %d\n, +REGNO (reg), DF_REF_INSN_UID (def)); + + bitmap_set_bit (regs, REGNO (reg)); + break; + } + } +} + + EXECUTE_IF_SET_IN_BITMAP (regs, 0, id, bi) +{ + for (df_ref def = DF_REG_DEF_CHAIN (id); + def; + def = DF_REF_NEXT_REG (def)) + if (bitmap_bit_p (insns, DF_REF_INSN_UID (def))) + { + if (dump_file) +
Re: [i386] Scalar DImode instructions on XMM registers
On 05/21/2015 05:54 AM, Ilya Enkovich wrote: Thanks. For me it looks like an inheritance bug. It is really hard to fix the bug w/o the source code. Could you send me your patch in order I can debug RA with it to investigate more. Sure! Here is a patch and a testcase. I applied patch to r222125. Cmd to reproduce: gcc -m32 -msse4.2 -O2 pr65105.c -S -march=slm -fPIE The problem is in sharing a subreg in different insns. Pseudo should be shared but not their subregs. We have before inheritance: 28: r132:V2DI=r132:V2DI|r126:DI#0 REG_DEAD r126:DI REG_DEAD r118:DI Inserting insn reload before: 81: r132:V2DI=r118:DI#0 Inserting insn reload after: 82: r108:DI#0=r132:V2DI ... Creating newreg=135, assigning class SSE_REGS to r135 42: r135:V2DI=r135:V2DIr108:DI#0 REG_DEAD r127:DI Inserting insn reload before: 85: r135:V2DI=r127:DI#0 Inserting insn reload after: 86: r108:DI#0=r135:V2DI As subreg of 108 in original insns 28 and 42 are shared, The subregs of 108 in insns 82 and 86 are shared too. During inheritance subpass we change r108 in insn 82 onto r137. This change insn 86 too. Creating newreg=137 from oldreg=108, assigning class NO_REX_SSE_REGS to inheritance r137 Original reg change 108-137 (bb2): 82: r137:DI#0=r132:V2DI REG_DEAD r132:V2DI Add original-inheritance after: 88: r108:DI=r137:DI Inheritance reuse change 108-137 (bb2): 68: r124:V2DI=r137:DI#0 And now we are trying to do inheritance for insn #86: Creating newreg=138 from oldreg=108, assigning class NO_REX_SSE_REGS to inheritance r138 Original reg change 108-138 (bb3): 86: r137:DI#0=r135:V2DI REG_DEAD r135:V2DI Add original-inheritance after: 89: r108:DI=r138:DI Inheritance reuse change 108-138 (bb3): 64: r123:V2DI=r137:DI#0 and after that having a complete mess. We are trying to change r108 onto r138, but r108 is already r137 because of sharing. Later we undo the second inheritance creating even more mess. So, Ilya, to solve the problem you need to avoid sharing subregs for the correct LRA/reload work.
Re: [i386] Scalar DImode instructions on XMM registers
On 19 May 11:22, Vladimir Makarov wrote: On 05/18/2015 08:13 AM, Ilya Enkovich wrote: 2015-05-06 17:18 GMT+03:00 Ilya Enkovich enkovich@gmail.com: Hi Vladimir, Could you please comment on this? Ilya, I think that the idea is worth to try but results might be mixed. It is hard to say until you actually try it (as example, Jan implemented -fpmath=both and it looks a pretty good idea at least for me but when I checked SPEC2000 the results were not so good even with IRA/LRA). Long ago I did some experiments and found that spilling into SSE would benefitial for Intel CPUs but not for AMD ones. As I remember I also found that storing several scalar values into one SSE reg and extracting it when you need to do some (fp) arithmetics would benefitial for AMD but not for Intel CPUs. In literature more general approach is called bitwise register allocator. Actually it would be a pretty big IRA/LRA project from which some targets might benefit. I suspect such things are not trivially done in IRA/LRA and want to make it as an independent optimization because its application seems to be quite narrow. As for the wrong code, it is hard for me to say anything w/o RA dumps. If you send me the dump (-fira-verbose=16), i might say more what is going on. Here are some dumps from my reproducer. The problematic register is r108. Thanks, Ilya ;; Function test (test, funcdef_no=0, decl_uid=1933, cgraph_uid=0, symbol_order=0) scanning new insn with uid = 79. starting the processing of deferred insns ending the processing of deferred insns df_analyze called df_worklist_dataflow_doublequeue:n_basic_blocks 5 n_edges 6 count 5 (1) starting the processing of deferred insns ending the processing of deferred insns df_analyze called Reg 119: local to bb 2 def dominates all uses has unique first use Reg 125 uninteresting Reg 118: local to bb 2 def dominates all uses has unique first use Reg 126 uninteresting Reg 127 uninteresting Found def insn 26 for 119 to be not moveable ;; 2 loops found ;; ;; Loop 0 ;; header 0, latch 1 ;; depth 0, outer -1 ;; nodes: 0 1 2 3 4 ;; ;; Loop 1 ;; header 3, latch 3 ;; depth 1, outer 0 ;; nodes: 3 ;; 2 succs { 3 4 } ;; 3 succs { 3 4 } ;; 4 succs { 1 } starting the processing of deferred insns ending the processing of deferred insns df_analyze called init_insns for 117: (insn_list:REG_DEP_TRUE 22 (nil)) test Dataflow summary: ;; invalidated by call 0 [ax] 1 [dx] 2 [cx] 8 [st] 9 [st(1)] 10 [st(2)] 11 [st(3)] 12 [st(4)] 13 [st(5)] 14 [st(6)] 15 [st(7)] 17 [flags] 18 [fpsr] 19 [fpcr] 21 [xmm0] 22 [xmm1] 23 [xmm2] 24 [xmm3] 25 [xmm4] 26 [xmm5] 27 [xmm6] 28 [xmm7] 29 [mm0] 30 [mm1] 31 [mm2] 32 [mm3] 33 [mm4] 34 [mm5] 35 [mm6] 36 [mm7] 37 [] 38 [] 39 [] 40 [] 41 [] 42 [] 43 [] 44 [] 45 [] 46 [] 47 [] 48 [] 49 [] 50 [] 51 [] 52 [] 53 [] 54 [] 55 [] 56 [] 57 [] 58 [] 59 [] 60 [] 61 [] 62 [] 63 [] 64 [] 65 [] 66 [] 67 [] 68 [] 69 [] 70 [] 71 [] 72 [] 73 [] 74 [] 75 [] 76 [] 77 [] 78 [] 79 [] 80 [] ;; hardware regs used 7 [sp] 16 [argp] 20 [frame] ;; regular block artificial uses6 [bp] 7 [sp] 16 [argp] 20 [frame] ;; eh block artificial uses 6 [bp] 7 [sp] 16 [argp] 20 [frame] ;; entry block defs 0 [ax] 1 [dx] 2 [cx] 6 [bp] 7 [sp] 16 [argp] 20 [frame] 21 [xmm0] 22 [xmm1] 23 [xmm2] 29 [mm0] 30 [mm1] 31 [mm2] ;; exit block uses 6 [bp] 7 [sp] 20 [frame] ;; regs ever live 3[bx] 7[sp] 17[flags] ;; ref usage r0={2d} r1={2d} r2={2d} r3={1d,1u} r6={1d,4u} r7={1d,7u} r8={1d} r9={1d} r10={1d} r11={1d} r12={1d} r13={1d} r14={1d} r15={1d} r16={1d,4u,1e} r17={5d,2u} r18={1d} r19={1d} r20={1d,4u} r21={2d} r22={2d} r23={2d} r24={1d} r25={1d} r26={1d} r27={1d} r28={1d} r29={2d} r30={2d} r31={2d} r32={1d} r33={1d} r34={1d} r35={1d} r36={1d} r37={1d} r38={1d} r39={1d} r40={1d} r41={1d} r42={1d} r43={1d} r44={1d} r45={1d} r46={1d} r47={1d} r48={1d} r49={1d} r50={1d} r51={1d} r52={1d} r53={1d} r54={1d} r55={1d} r56={1d} r57={1d} r58={1d} r59={1d} r60={1d} r61={1d} r62={1d} r63={1d} r64={1d} r65={1d} r66={1d} r67={1d} r68={1d} r69={1d} r70={1d} r71={1d} r72={1d} r73={1d} r74={1d} r75={1d} r76={1d} r77={1d} r78={1d} r79={1d} r80={1d} r107={1d,1u} r108={2d,4u} r117={2d,5u,2e} r118={1d,1u} r119={1d,1u} r123={2d,3u} r124={2d,3u} r125={1d,1u} r126={1d,1u} r127={1d,1u} r128={2d,2u} r129={2d,2u} ;;total ref usage 160{110d,47u,3e} in 25{24 regular + 1 call} insns. (note 21 0 24 NOTE_INSN_DELETED) (note 24 21 79 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (insn/f 79 24 22 2 (parallel [ (set (reg:SI 107) (unspec:SI [ (const_int 0 [0]) ] UNSPEC_SET_GOT)) (clobber (reg:CC 17 flags)) ]) 694 {set_got} (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_EQUIV (unspec:SI [ (const_int 0 [0]) ] UNSPEC_SET_GOT) (expr_list:REG_CFA_FLUSH_QUEUE (nil) (nil) (insn
Re: [i386] Scalar DImode instructions on XMM registers
On 20/05/15 04:17 AM, Ilya Enkovich wrote: On 19 May 11:22, Vladimir Makarov wrote: On 05/18/2015 08:13 AM, Ilya Enkovich wrote: 2015-05-06 17:18 GMT+03:00 Ilya Enkovich enkovich@gmail.com: Hi Vladimir, Could you please comment on this? Ilya, I think that the idea is worth to try but results might be mixed. It is hard to say until you actually try it (as example, Jan implemented -fpmath=both and it looks a pretty good idea at least for me but when I checked SPEC2000 the results were not so good even with IRA/LRA). Long ago I did some experiments and found that spilling into SSE would benefitial for Intel CPUs but not for AMD ones. As I remember I also found that storing several scalar values into one SSE reg and extracting it when you need to do some (fp) arithmetics would benefitial for AMD but not for Intel CPUs. In literature more general approach is called bitwise register allocator. Actually it would be a pretty big IRA/LRA project from which some targets might benefit. I suspect such things are not trivially done in IRA/LRA and want to make it as an independent optimization because its application seems to be quite narrow. Yes, that is true. The complications and implementation complexity will be probably very high in this project and the positive results are not sure. So the project might have a small value. As for the wrong code, it is hard for me to say anything w/o RA dumps. If you send me the dump (-fira-verbose=16), i might say more what is going on. Here are some dumps from my reproducer. The problematic register is r108. Thanks. For me it looks like an inheritance bug. It is really hard to fix the bug w/o the source code. Could you send me your patch in order I can debug RA with it to investigate more.
Re: [i386] Scalar DImode instructions on XMM registers
On 05/18/2015 08:13 AM, Ilya Enkovich wrote: 2015-05-06 17:18 GMT+03:00 Ilya Enkovich enkovich@gmail.com: 2015-04-25 4:32 GMT+03:00 Jan Hubicka hubi...@ucw.cz: Hi, I am adding Vladimir and Richard into CC. I tried to solve similar problem with FP math years ago by having -mfpmath=sse,i387. The idea was to allow use of i387 registers when SSE ones run out and possibly also model the fact that Pentium4 had faster i387 additions than SSE additions. I also had some plans to extend this one mixed SSE/MMX/GPR integer arithmetics, but never got to that. This did not really fly becuase of the regalloc not really being able to understnad it (I made path to regclass to propagate the classes and figure out what operations needs to stay in i387 and what in SSE to avoid reloading, but that never got in). I believe Vladimir did some work on this with IRA (he is able to spill GPR regs into SSE and do bit of other tricks). Also I believe it was kind of Richard's design deicsion to avoid use of (paradoxical) subregs for vector conversions because these have funny implications. The code for handling upper parts of paradoxical subregs is controlled by macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle V1DI-V2DI conversions fluently without some middle-end hacking. (it will probably try to produce zero extensions) When we are on SSE instructions, it would be great to finally teach copy_by_pieces/store_by_pieces to use vector instructions (these are more compact and either equaly fast or faster on some CPUs). I hope to get into this, but it would be great if someone beat me. Honza I'm trying to implement it as separate RTL pass which chooses a scalar/vector mode for each 64bit computation chain and performs transformation if we choose to use vectors. I also want to split DI instructions which are going to be implemented on GPRs before RA (currently it is done on the second split). Good metrics for such transformation is a big question but currently I can't even make it generate correct code when paradoxical subregs are used. It works in simple cases but I get troubles when spills appear. Trying to beat the following testcase: test (long long *arr) { register unsigned long long tmp; tmp = arr[0] | arr[1] arr[2]; while (tmp) { counter (tmp); tmp = *(arr++) tmp; } } RTL I generate seems OK to me (ignoring the fact that it is not optimal): (insn 6 3 50 2 (set (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ]) (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D) + 8B]+0 S8 A64])) pr65105-1.c:22 89 {*movdi_internal} (nil)) (insn 50 6 7 2 (set (reg:DI 104) (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 16 [0x10])) [2 MEM[(long long int *)arr_5(D) + 16B]+0 S8 A64])) pr65105-1.c:22 -1 (nil)) (insn 7 50 51 2 (set (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0) (and:V2DI (subreg:V2DI (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ]) 0) (subreg:V2DI (reg:DI 104) 0))) pr65105-1.c:22 3487 {*andv2di3} (expr_list:REG_DEAD (subreg:V2DI (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ]) 0) (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_EQUAL (and:DI (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D) + 8B]+0 S8 A64]) (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 16 [0x10])) [2 MEM[(long long int *)arr_5(D) + 16B]+0 S8 A64])) (nil) (insn 51 7 8 2 (set (reg:DI 105) (mem:DI (reg/v/f:SI 96 [ arr ]) [2 *arr_5(D)+0 S8 A64])) pr65105-1.c:22 -1 (nil)) (insn 8 51 46 2 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0) (ior:V2DI (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0) (subreg:V2DI (reg:DI 105) 0))) pr65105-1.c:22 3489 {*iorv2di3} (expr_list:REG_DEAD (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil (insn 46 8 47 2 (set (reg:V2DI 103) (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:22 -1 (nil)) (insn 47 46 48 2 (set (subreg:SI (reg:DI 101) 0) (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1 (nil)) (insn 48 47 49 2 (set (reg:V2DI 103) (lshiftrt:V2DI (reg:V2DI 103) (const_int 32 [0x20]))) pr65105-1.c:22 -1 (nil)) (insn 49 48 9 2 (set (subreg:SI (reg:DI 101) 4) (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1 (nil)) (note 9 49 10 2 NOTE_INSN_DELETED) (insn 10 9 11 2 (parallel [ (set (reg:CCZ 17 flags) (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4) (subreg:SI (reg:DI 101) 0)) (const_int 0 [0]))) (clobber (scratch:SI)) ]) pr65105-1.c:23 447 {*iorsi_3} (nil)) (jump_insn 11 10 37 2 (set (pc) (if_then_else
Re: [i386] Scalar DImode instructions on XMM registers
2015-05-06 17:18 GMT+03:00 Ilya Enkovich enkovich@gmail.com: 2015-04-25 4:32 GMT+03:00 Jan Hubicka hubi...@ucw.cz: Hi, I am adding Vladimir and Richard into CC. I tried to solve similar problem with FP math years ago by having -mfpmath=sse,i387. The idea was to allow use of i387 registers when SSE ones run out and possibly also model the fact that Pentium4 had faster i387 additions than SSE additions. I also had some plans to extend this one mixed SSE/MMX/GPR integer arithmetics, but never got to that. This did not really fly becuase of the regalloc not really being able to understnad it (I made path to regclass to propagate the classes and figure out what operations needs to stay in i387 and what in SSE to avoid reloading, but that never got in). I believe Vladimir did some work on this with IRA (he is able to spill GPR regs into SSE and do bit of other tricks). Also I believe it was kind of Richard's design deicsion to avoid use of (paradoxical) subregs for vector conversions because these have funny implications. The code for handling upper parts of paradoxical subregs is controlled by macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle V1DI-V2DI conversions fluently without some middle-end hacking. (it will probably try to produce zero extensions) When we are on SSE instructions, it would be great to finally teach copy_by_pieces/store_by_pieces to use vector instructions (these are more compact and either equaly fast or faster on some CPUs). I hope to get into this, but it would be great if someone beat me. Honza I'm trying to implement it as separate RTL pass which chooses a scalar/vector mode for each 64bit computation chain and performs transformation if we choose to use vectors. I also want to split DI instructions which are going to be implemented on GPRs before RA (currently it is done on the second split). Good metrics for such transformation is a big question but currently I can't even make it generate correct code when paradoxical subregs are used. It works in simple cases but I get troubles when spills appear. Trying to beat the following testcase: test (long long *arr) { register unsigned long long tmp; tmp = arr[0] | arr[1] arr[2]; while (tmp) { counter (tmp); tmp = *(arr++) tmp; } } RTL I generate seems OK to me (ignoring the fact that it is not optimal): (insn 6 3 50 2 (set (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ]) (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D) + 8B]+0 S8 A64])) pr65105-1.c:22 89 {*movdi_internal} (nil)) (insn 50 6 7 2 (set (reg:DI 104) (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 16 [0x10])) [2 MEM[(long long int *)arr_5(D) + 16B]+0 S8 A64])) pr65105-1.c:22 -1 (nil)) (insn 7 50 51 2 (set (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0) (and:V2DI (subreg:V2DI (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ]) 0) (subreg:V2DI (reg:DI 104) 0))) pr65105-1.c:22 3487 {*andv2di3} (expr_list:REG_DEAD (subreg:V2DI (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ]) 0) (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_EQUAL (and:DI (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D) + 8B]+0 S8 A64]) (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 16 [0x10])) [2 MEM[(long long int *)arr_5(D) + 16B]+0 S8 A64])) (nil) (insn 51 7 8 2 (set (reg:DI 105) (mem:DI (reg/v/f:SI 96 [ arr ]) [2 *arr_5(D)+0 S8 A64])) pr65105-1.c:22 -1 (nil)) (insn 8 51 46 2 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0) (ior:V2DI (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0) (subreg:V2DI (reg:DI 105) 0))) pr65105-1.c:22 3489 {*iorv2di3} (expr_list:REG_DEAD (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil (insn 46 8 47 2 (set (reg:V2DI 103) (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:22 -1 (nil)) (insn 47 46 48 2 (set (subreg:SI (reg:DI 101) 0) (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1 (nil)) (insn 48 47 49 2 (set (reg:V2DI 103) (lshiftrt:V2DI (reg:V2DI 103) (const_int 32 [0x20]))) pr65105-1.c:22 -1 (nil)) (insn 49 48 9 2 (set (subreg:SI (reg:DI 101) 4) (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1 (nil)) (note 9 49 10 2 NOTE_INSN_DELETED) (insn 10 9 11 2 (parallel [ (set (reg:CCZ 17 flags) (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4) (subreg:SI (reg:DI 101) 0)) (const_int 0 [0]))) (clobber (scratch:SI)) ]) pr65105-1.c:23 447 {*iorsi_3} (nil)) (jump_insn 11 10 37 2 (set (pc)
Re: [i386] Scalar DImode instructions on XMM registers
On 05/07/2015 10:59 AM, Uros Bizjak wrote: If we consider SSE operations as DImode operations, we will loose the ability to precisely specify which operation (SSE vs. general reg) we want. I'm afraid that in DImode case, combine will choose FLAG-less pattern that will mandate moves from general regs to SSE regs and back. This was the reason to invent V1DImode/V1TImode vectors to avoid moving double-mode values to MMX/SSE regs for double-mode shifts. It would of course have to be a combined pattern. The problem being addressed by V1TImode is that SSE doesn't really support TImode arithmetic. We've got some logical operations and restricted shifting, but no addition, multiplication, or fully general shifting. The problem being addressed by V1DImode is MMX, about which I believe I need say nothing more, and the fact that lower-subreg produces better results than the current RA. r~
Re: [i386] Scalar DImode instructions on XMM registers
On 05/07/2015 09:24 AM, Richard Henderson wrote: I was wondering this morning about the possibility of a kind of constraint that would allow RA to generate pairs of registers via CONCAT. That is, the two hard registers within the CONCAT are collectively the double-word allocation, but need not be sequential like current multi-word allocations. A target using such a constraint is promising to handle the CONCAT either by splitting (and gen_lowpart et al), or print_operand letters (e.g. the m68k %R, for outputting the low part of a pair). With that, we get the best of both -- lower-subreg effectively happening in RA, and DImode arithmetic in SSE no subregs required. I forgot one issue that lower-subreg also cures -- describing the lifetime of the pair of registers. We wouldn't get that with a single bit saying that CONCAT is ok. E.g. di100 = di101 + di102 split to (flags, si200) = si201 + si202 si300 = si301 + si302 + carry(flags) If we split prior to RA, we can see that si200 cannot overlap si301 or si302. If we split after RA, we have to handle this ourselves in the backend, leading to additional matching-constraint alternatives and/or early-clobbers. We'd need a couple of bits: one saying that concat is ok, the other saying whether all lows are consumed before all highs, when allocating a set of CONCATs across all of the operands. Or perhaps we don't need such a bit and we merely include high inputs not clobbered by low output as part of the contract with RA. r~
Re: [i386] Scalar DImode instructions on XMM registers
On Thu, May 7, 2015 at 6:24 PM, Richard Henderson r...@redhat.com wrote: On 04/24/2015 06:32 PM, Jan Hubicka wrote: Also I believe it was kind of Richard's design deicsion to avoid use of (paradoxical) subregs for vector conversions because these have funny implications. Yes indeed. The code for handling upper parts of paradoxical subregs is controlled by macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle V1DI-V2DI conversions fluently without some middle-end hacking. (it will probably try to produce zero extensions) When we are on SSE instructions, it would be great to finally teach copy_by_pieces/store_by_pieces to use vector instructions (these are more compact and either equaly fast or faster on some CPUs). I hope to get into this, but it would be great if someone beat me. Well, I think it would be worthwhile to teach the i386 backend how to do 64-bit vectors in SSE registers. First, this would aid portability with other targets who may have GCC generic vectors written only for 8 byte quantities. Since we do have zero-extending 8 byte load/store insns for SSE, we don't actually need paradoxical regs, just additional macro-ization of the existing patterns. If we consider SSE operations as DImode operations, we will loose the ability to precisely specify which operation (SSE vs. general reg) we want. I'm afraid that in DImode case, combine will choose FLAG-less pattern that will mandate moves from general regs to SSE regs and back. This was the reason to invent V1DImode/V1TImode vectors to avoid moving double-mode values to MMX/SSE regs for double-mode shifts. The alternative would be RA that is able to select between alternative instructions, not only between alternative register classes. Uros.
Re: [i386] Scalar DImode instructions on XMM registers
On 04/24/2015 06:32 PM, Jan Hubicka wrote: Also I believe it was kind of Richard's design deicsion to avoid use of (paradoxical) subregs for vector conversions because these have funny implications. Yes indeed. The code for handling upper parts of paradoxical subregs is controlled by macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle V1DI-V2DI conversions fluently without some middle-end hacking. (it will probably try to produce zero extensions) When we are on SSE instructions, it would be great to finally teach copy_by_pieces/store_by_pieces to use vector instructions (these are more compact and either equaly fast or faster on some CPUs). I hope to get into this, but it would be great if someone beat me. Well, I think it would be worthwhile to teach the i386 backend how to do 64-bit vectors in SSE registers. First, this would aid portability with other targets who may have GCC generic vectors written only for 8 byte quantities. Since we do have zero-extending 8 byte load/store insns for SSE, we don't actually need paradoxical regs, just additional macro-ization of the existing patterns. This almost certainly would conflict with the MMX code generation. But given the problems we've always had with that, perhaps it's time to kill that off. To a large extent we can preserve source compatibility with MMX builtins once we have 8-byte vectors implemented in SSE. As for the subject, we'd want to delay expansion of DImode arithmetic until after RA. That bypasses all of the good work done in lower-subreg.c, so we need some sort of replacement. I was wondering this morning about the possibility of a kind of constraint that would allow RA to generate pairs of registers via CONCAT. That is, the two hard registers within the CONCAT are collectively the double-word allocation, but need not be sequential like current multi-word allocations. A target using such a constraint is promising to handle the CONCAT either by splitting (and gen_lowpart et al), or print_operand letters (e.g. the m68k %R, for outputting the low part of a pair). With that, we get the best of both -- lower-subreg effectively happening in RA, and DImode arithmetic in SSE no subregs required. r~
Re: [i386] Scalar DImode instructions on XMM registers
2015-04-25 4:32 GMT+03:00 Jan Hubicka hubi...@ucw.cz: Hi, I am adding Vladimir and Richard into CC. I tried to solve similar problem with FP math years ago by having -mfpmath=sse,i387. The idea was to allow use of i387 registers when SSE ones run out and possibly also model the fact that Pentium4 had faster i387 additions than SSE additions. I also had some plans to extend this one mixed SSE/MMX/GPR integer arithmetics, but never got to that. This did not really fly becuase of the regalloc not really being able to understnad it (I made path to regclass to propagate the classes and figure out what operations needs to stay in i387 and what in SSE to avoid reloading, but that never got in). I believe Vladimir did some work on this with IRA (he is able to spill GPR regs into SSE and do bit of other tricks). Also I believe it was kind of Richard's design deicsion to avoid use of (paradoxical) subregs for vector conversions because these have funny implications. The code for handling upper parts of paradoxical subregs is controlled by macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle V1DI-V2DI conversions fluently without some middle-end hacking. (it will probably try to produce zero extensions) When we are on SSE instructions, it would be great to finally teach copy_by_pieces/store_by_pieces to use vector instructions (these are more compact and either equaly fast or faster on some CPUs). I hope to get into this, but it would be great if someone beat me. Honza I'm trying to implement it as separate RTL pass which chooses a scalar/vector mode for each 64bit computation chain and performs transformation if we choose to use vectors. I also want to split DI instructions which are going to be implemented on GPRs before RA (currently it is done on the second split). Good metrics for such transformation is a big question but currently I can't even make it generate correct code when paradoxical subregs are used. It works in simple cases but I get troubles when spills appear. Trying to beat the following testcase: test (long long *arr) { register unsigned long long tmp; tmp = arr[0] | arr[1] arr[2]; while (tmp) { counter (tmp); tmp = *(arr++) tmp; } } RTL I generate seems OK to me (ignoring the fact that it is not optimal): (insn 6 3 50 2 (set (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ]) (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D) + 8B]+0 S8 A64])) pr65105-1.c:22 89 {*movdi_internal} (nil)) (insn 50 6 7 2 (set (reg:DI 104) (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 16 [0x10])) [2 MEM[(long long int *)arr_5(D) + 16B]+0 S8 A64])) pr65105-1.c:22 -1 (nil)) (insn 7 50 51 2 (set (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0) (and:V2DI (subreg:V2DI (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ]) 0) (subreg:V2DI (reg:DI 104) 0))) pr65105-1.c:22 3487 {*andv2di3} (expr_list:REG_DEAD (subreg:V2DI (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ]) 0) (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_EQUAL (and:DI (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D) + 8B]+0 S8 A64]) (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ]) (const_int 16 [0x10])) [2 MEM[(long long int *)arr_5(D) + 16B]+0 S8 A64])) (nil) (insn 51 7 8 2 (set (reg:DI 105) (mem:DI (reg/v/f:SI 96 [ arr ]) [2 *arr_5(D)+0 S8 A64])) pr65105-1.c:22 -1 (nil)) (insn 8 51 46 2 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0) (ior:V2DI (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0) (subreg:V2DI (reg:DI 105) 0))) pr65105-1.c:22 3489 {*iorv2di3} (expr_list:REG_DEAD (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil (insn 46 8 47 2 (set (reg:V2DI 103) (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:22 -1 (nil)) (insn 47 46 48 2 (set (subreg:SI (reg:DI 101) 0) (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1 (nil)) (insn 48 47 49 2 (set (reg:V2DI 103) (lshiftrt:V2DI (reg:V2DI 103) (const_int 32 [0x20]))) pr65105-1.c:22 -1 (nil)) (insn 49 48 9 2 (set (subreg:SI (reg:DI 101) 4) (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1 (nil)) (note 9 49 10 2 NOTE_INSN_DELETED) (insn 10 9 11 2 (parallel [ (set (reg:CCZ 17 flags) (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4) (subreg:SI (reg:DI 101) 0)) (const_int 0 [0]))) (clobber (scratch:SI)) ]) pr65105-1.c:23 447 {*iorsi_3} (nil)) (jump_insn 11 10 37 2 (set (pc) (if_then_else (ne (reg:CCZ 17 flags) (const_int 0 [0])) (label_ref:SI 37) (pc))) pr65105-1.c:23 619 {*jcc_1}
Re: [i386] Scalar DImode instructions on XMM registers
On Fri, Apr 24, 2015 at 11:45 AM, Uros Bizjak ubiz...@gmail.com wrote: On Fri, Apr 24, 2015 at 11:22 AM, Ilya Enkovich enkovich@gmail.com wrote: I was looking into PR65105 and tried to generate SSE computation for a simple 64bit a + b + c sequence. Having no scalar integer instructions in SSE I have to use vector variants. Is this approach really better that having two add/addc instructions? FYI, V1DI mode was introduced because XMM shift insn were used to shift DImode values. The cost of moves from/to integer DImode reg pair was disastrous. Uros.
Re: [i386] Scalar DImode instructions on XMM registers
On Fri, Apr 24, 2015 at 11:22 AM, Ilya Enkovich enkovich@gmail.com wrote: I was looking into PR65105 and tried to generate SSE computation for a simple 64bit a + b + c sequence. Having no scalar integer instructions in SSE I have to use vector variants. Is this approach really better that having two add/addc instructions? Uros.
Re: [i386] Scalar DImode instructions on XMM registers
On Fri, Apr 24, 2015 at 12:14 PM, Uros Bizjak ubiz...@gmail.com wrote: I was looking into PR65105 and tried to generate SSE computation for a simple 64bit a + b + c sequence. Having no scalar integer instructions in SSE I have to use vector variants. Is this approach really better that having two add/addc instructions? FYI, V1DI mode was introduced because XMM shift insn were used to shift DImode values. The cost of moves from/to integer DImode reg pair was disastrous. Uros. Does it mean I have to add V1DI instructions for all opcodes I want to transform (add,sub,mul,or,and, etc.)? No. Please try to generate paradoxical subreg (V2DImode subreg of V1DImode pseudo). IIRC, there is some functionality in the compiler that is able to tell if the highpart of the paradoxical register is zeroed. Probably you can even generate paradoxical V2DImode subreg of DImode. I'm not sure if in this case register allocator degenerates the mode of resulting hard register to DImode, it is worth a try. Uros.
Re: [i386] Scalar DImode instructions on XMM registers
2015-04-24 12:49 GMT+03:00 Uros Bizjak ubiz...@gmail.com: On Fri, Apr 24, 2015 at 11:45 AM, Uros Bizjak ubiz...@gmail.com wrote: On Fri, Apr 24, 2015 at 11:22 AM, Ilya Enkovich enkovich@gmail.com wrote: I was looking into PR65105 and tried to generate SSE computation for a simple 64bit a + b + c sequence. Having no scalar integer instructions in SSE I have to use vector variants. Is this approach really better that having two add/addc instructions? FYI, V1DI mode was introduced because XMM shift insn were used to shift DImode values. The cost of moves from/to integer DImode reg pair was disastrous. Uros. Does it mean I have to add V1DI instructions for all opcodes I want to transform (add,sub,mul,or,and, etc.)? Ilya
Re: [i386] Scalar DImode instructions on XMM registers
On Fri, Apr 24, 2015 at 12:09 PM, Ilya Enkovich enkovich@gmail.com wrote: I was looking into PR65105 and tried to generate SSE computation for a simple 64bit a + b + c sequence. Having no scalar integer instructions in SSE I have to use vector variants. Is this approach really better that having two add/addc instructions? FYI, V1DI mode was introduced because XMM shift insn were used to shift DImode values. The cost of moves from/to integer DImode reg pair was disastrous. Uros. Does it mean I have to add V1DI instructions for all opcodes I want to transform (add,sub,mul,or,and, etc.)? No. Please try to generate paradoxical subreg (V2DImode subreg of V1DImode pseudo). IIRC, there is some functionality in the compiler that is able to tell if the highpart of the paradoxical register is zeroed. Uros.
Re: [i386] Scalar DImode instructions on XMM registers
On Fri, 24 Apr 2015, Uros Bizjak wrote: Please try to generate paradoxical subreg (V2DImode subreg of V1DImode pseudo). IIRC, there is some functionality in the compiler that is able to tell if the highpart of the paradoxical register is zeroed. Those are not currently legal (I tried to change that) https://gcc.gnu.org/ml/gcc-patches/2013-03/msg00745.html https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00769.html In this case, a subreg:V2DI of DImode should work. -- Marc Glisse
Re: [i386] Scalar DImode instructions on XMM registers
2015-04-24 12:45 GMT+03:00 Uros Bizjak ubiz...@gmail.com: On Fri, Apr 24, 2015 at 11:22 AM, Ilya Enkovich enkovich@gmail.com wrote: I was looking into PR65105 and tried to generate SSE computation for a simple 64bit a + b + c sequence. Having no scalar integer instructions in SSE I have to use vector variants. Is this approach really better that having two add/addc instructions? We surely shouldn't apply this for each DI instruction and compute transformation costs. It is profitable if not many conversions required, it helps to relax GPR pressure, we expect it to be profitable for mul. Performance tests will show if this is useful. I want to make a small prototype and try it. Ilya Uros.
Re: [i386] Scalar DImode instructions on XMM registers
2015-04-24 13:27 GMT+03:00 Marc Glisse marc.gli...@inria.fr: On Fri, 24 Apr 2015, Uros Bizjak wrote: Please try to generate paradoxical subreg (V2DImode subreg of V1DImode pseudo). IIRC, there is some functionality in the compiler that is able to tell if the highpart of the paradoxical register is zeroed. Those are not currently legal (I tried to change that) https://gcc.gnu.org/ml/gcc-patches/2013-03/msg00745.html https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00769.html In this case, a subreg:V2DI of DImode should work. -- Marc Glisse Thank you for you tips! It seems to work, will try and see what it gives us for i386. Thanks, Ilya
Re: [i386] Scalar DImode instructions on XMM registers
Hi, I am adding Vladimir and Richard into CC. I tried to solve similar problem with FP math years ago by having -mfpmath=sse,i387. The idea was to allow use of i387 registers when SSE ones run out and possibly also model the fact that Pentium4 had faster i387 additions than SSE additions. I also had some plans to extend this one mixed SSE/MMX/GPR integer arithmetics, but never got to that. This did not really fly becuase of the regalloc not really being able to understnad it (I made path to regclass to propagate the classes and figure out what operations needs to stay in i387 and what in SSE to avoid reloading, but that never got in). I believe Vladimir did some work on this with IRA (he is able to spill GPR regs into SSE and do bit of other tricks). Also I believe it was kind of Richard's design deicsion to avoid use of (paradoxical) subregs for vector conversions because these have funny implications. The code for handling upper parts of paradoxical subregs is controlled by macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle V1DI-V2DI conversions fluently without some middle-end hacking. (it will probably try to produce zero extensions) When we are on SSE instructions, it would be great to finally teach copy_by_pieces/store_by_pieces to use vector instructions (these are more compact and either equaly fast or faster on some CPUs). I hope to get into this, but it would be great if someone beat me. Honza 2015-04-24 13:27 GMT+03:00 Marc Glisse marc.gli...@inria.fr: On Fri, 24 Apr 2015, Uros Bizjak wrote: Please try to generate paradoxical subreg (V2DImode subreg of V1DImode pseudo). IIRC, there is some functionality in the compiler that is able to tell if the highpart of the paradoxical register is zeroed. Those are not currently legal (I tried to change that) https://gcc.gnu.org/ml/gcc-patches/2013-03/msg00745.html https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00769.html In this case, a subreg:V2DI of DImode should work. -- Marc Glisse Thank you for you tips! It seems to work, will try and see what it gives us for i386. Thanks, Ilya