Re: [i386] Scalar DImode instructions on XMM registers

2015-06-03 Thread Jeff Law

On 05/27/2015 07:20 AM, Ilya Enkovich wrote:


I looked into assign_stack_local_1 call for this spill. LRA correctly
requests 16 bytes size with 16 bytes alignment. But
assign_stack_local_1 look reduces alignment to 8 because estimated
stack alignment before RA is 8 and requested mode's (DI) alignment
fits it. Probably LRA should pass biggest_mode of the reg when
requesting a stack slot?
It's hard to say for sure.  Within the lra_reg structure, biggest_mode 
refers to the largest mode in which a pseudo is referenced.  So for a 
pseudo it might make sense.  Presumably the biggest_mode for the pseudo 
in question is larger than DImode, right?




I handled it by increasing stack_alignment_estimated when transform
some instructions to vector mode.
I haven't looked deeply, but if your pass runs after 
stack_alignment_estimated is initially computed, then this seems like a 
desirable way to fix the problem.


jeff


Re: [i386] Scalar DImode instructions on XMM registers

2015-05-27 Thread Ilya Enkovich
2015-05-27 6:31 GMT+03:00 Jeff Law l...@redhat.com:
 On 05/25/2015 09:27 AM, Ilya Enkovich wrote:

 2015-05-22 15:01 GMT+03:00 Ilya Enkovich enkovich@gmail.com:

 2015-05-22 11:53 GMT+03:00 Ilya Enkovich enkovich@gmail.com:

 2015-05-21 22:08 GMT+03:00 Vladimir Makarov vmaka...@redhat.com:

 So, Ilya, to solve the problem you need to avoid sharing subregs for
 the
 correct LRA/reload work.



 Thanks a lot for your help! I'll fix it.

 Ilya


 I've fixed SUBREG sharing and got a missing spill. I added
 --enable-checking=rtl to check other possible bugs. Spill/fill code
 still seems incorrect because different sizes are used.  Shouldn't
 block me though.

 .L5:
  movl16(%esp), %eax
  addl$8, %esi
  movl20(%esp), %edx
  movl%eax, (%esp)
  movl%edx, 4(%esp)
  callcounter@PLT
  movq-8(%esi), %xmm0
  **movdqa  16(%esp), %xmm2**
  pand%xmm0, %xmm2
  movdqa  %xmm2, %xmm0
  movd%xmm2, %edx
  **movq%xmm2, 16(%esp)**
  psrlq   $32, %xmm0
  movd%xmm0, %eax
  orl %edx, %eax
  jne .L5

 Thanks,
 Ilya


 I was wrong assuming reloads with wrong size shouldn't block me. These
 reloads require memory to be aligned which is not always true. Here is
 what I have in RTL now:

 (insn 2 7 3 2 (set (reg/v:DI 93 [ l ])
  (mem/c:DI (reg/f:SI 16 argp) [1 l+0 S8 A32])) test.c:5 89
 {*movdi_internal}
   (nil))
 ...
 (insn 27 26 52 6 (set (subreg:V2DI (reg:DI 87 [ D.1822 ]) 0)
  (ior:V2DI (subreg:V2DI (reg:DI 99 [ D.1822 ]) 0)
  (subreg:V2DI (reg/v:DI 93 [ l ]) 0))) test.c:11 3489
 {*iorv2di3}
   (expr_list:REG_DEAD (reg:DI 99 [ D.1822 ])
  (expr_list:REG_DEAD (reg/v:DI 93 [ l ])
  (nil

 After reload I get:

 (insn 2 7 75 2 (set (reg/v:DI 0 ax [orig:93 l ] [93])
  (mem/c:DI (plus:SI (reg/f:SI 7 sp)
  (const_int 24 [0x18])) [1 l+0 S8 A32])) test.c:5 89
 {*movdi_internal}
   (nil))
 (insn 75 2 3 2 (set (mem/c:DI (reg/f:SI 7 sp) [3 %sfp+-16 S8 A64])
  (reg/v:DI 0 ax [orig:93 l ] [93])) test.c:5 89 {*movdi_internal}
   (nil))
 ...
 (insn 27 26 52 6 (set (reg:V2DI 21 xmm0 [orig:87 D.1822 ] [87])
  (ior:V2DI (reg:V2DI 21 xmm0 [orig:99 D.1822 ] [99])
  (mem/c:V2DI (reg/f:SI 7 sp) [3 %sfp+-16 S16 A64])))
 test.c:11 3489 {*iorv2di3}


 'por' instruction requires memory to be aligned and fails in a bigger
 testcase. There is also movdqa generated for esp by reload. May it
 mean I still have some inconsistencies in the produced RTL? Probably I
 should somehow transform loads and stores?

 I'd start by looking at the AP-SP elimination step.  What's the defined
 stack alignment and whether or not a dynamic stack realignment is needed.
 If you don't have all that setup properly prior to the allocators, then
 they're not going to know how what objects to align nor how to align them.

I looked into assign_stack_local_1 call for this spill. LRA correctly
requests 16 bytes size with 16 bytes alignment. But
assign_stack_local_1 look reduces alignment to 8 because estimated
stack alignment before RA is 8 and requested mode's (DI) alignment
fits it. Probably LRA should pass biggest_mode of the reg when
requesting a stack slot?

I handled it by increasing stack_alignment_estimated when transform
some instructions to vector mode.

Thanks for help!

Ilya


 jeff



Re: [i386] Scalar DImode instructions on XMM registers

2015-05-26 Thread Jeff Law

On 05/25/2015 09:27 AM, Ilya Enkovich wrote:

2015-05-22 15:01 GMT+03:00 Ilya Enkovich enkovich@gmail.com:

2015-05-22 11:53 GMT+03:00 Ilya Enkovich enkovich@gmail.com:

2015-05-21 22:08 GMT+03:00 Vladimir Makarov vmaka...@redhat.com:

So, Ilya, to solve the problem you need to avoid sharing subregs for the
correct LRA/reload work.




Thanks a lot for your help! I'll fix it.

Ilya


I've fixed SUBREG sharing and got a missing spill. I added
--enable-checking=rtl to check other possible bugs. Spill/fill code
still seems incorrect because different sizes are used.  Shouldn't
block me though.

.L5:
 movl16(%esp), %eax
 addl$8, %esi
 movl20(%esp), %edx
 movl%eax, (%esp)
 movl%edx, 4(%esp)
 callcounter@PLT
 movq-8(%esi), %xmm0
 **movdqa  16(%esp), %xmm2**
 pand%xmm0, %xmm2
 movdqa  %xmm2, %xmm0
 movd%xmm2, %edx
 **movq%xmm2, 16(%esp)**
 psrlq   $32, %xmm0
 movd%xmm0, %eax
 orl %edx, %eax
 jne .L5

Thanks,
Ilya


I was wrong assuming reloads with wrong size shouldn't block me. These
reloads require memory to be aligned which is not always true. Here is
what I have in RTL now:

(insn 2 7 3 2 (set (reg/v:DI 93 [ l ])
 (mem/c:DI (reg/f:SI 16 argp) [1 l+0 S8 A32])) test.c:5 89
{*movdi_internal}
  (nil))
...
(insn 27 26 52 6 (set (subreg:V2DI (reg:DI 87 [ D.1822 ]) 0)
 (ior:V2DI (subreg:V2DI (reg:DI 99 [ D.1822 ]) 0)
 (subreg:V2DI (reg/v:DI 93 [ l ]) 0))) test.c:11 3489 {*iorv2di3}
  (expr_list:REG_DEAD (reg:DI 99 [ D.1822 ])
 (expr_list:REG_DEAD (reg/v:DI 93 [ l ])
 (nil

After reload I get:

(insn 2 7 75 2 (set (reg/v:DI 0 ax [orig:93 l ] [93])
 (mem/c:DI (plus:SI (reg/f:SI 7 sp)
 (const_int 24 [0x18])) [1 l+0 S8 A32])) test.c:5 89
{*movdi_internal}
  (nil))
(insn 75 2 3 2 (set (mem/c:DI (reg/f:SI 7 sp) [3 %sfp+-16 S8 A64])
 (reg/v:DI 0 ax [orig:93 l ] [93])) test.c:5 89 {*movdi_internal}
  (nil))
...
(insn 27 26 52 6 (set (reg:V2DI 21 xmm0 [orig:87 D.1822 ] [87])
 (ior:V2DI (reg:V2DI 21 xmm0 [orig:99 D.1822 ] [99])
 (mem/c:V2DI (reg/f:SI 7 sp) [3 %sfp+-16 S16 A64])))
test.c:11 3489 {*iorv2di3}


'por' instruction requires memory to be aligned and fails in a bigger
testcase. There is also movdqa generated for esp by reload. May it
mean I still have some inconsistencies in the produced RTL? Probably I
should somehow transform loads and stores?
I'd start by looking at the AP-SP elimination step.  What's the defined 
stack alignment and whether or not a dynamic stack realignment is 
needed.  If you don't have all that setup properly prior to the 
allocators, then they're not going to know how what objects to align nor 
how to align them.


jeff



Re: [i386] Scalar DImode instructions on XMM registers

2015-05-25 Thread Ilya Enkovich
2015-05-22 15:01 GMT+03:00 Ilya Enkovich enkovich@gmail.com:
 2015-05-22 11:53 GMT+03:00 Ilya Enkovich enkovich@gmail.com:
 2015-05-21 22:08 GMT+03:00 Vladimir Makarov vmaka...@redhat.com:
 So, Ilya, to solve the problem you need to avoid sharing subregs for the
 correct LRA/reload work.



 Thanks a lot for your help! I'll fix it.

 Ilya

 I've fixed SUBREG sharing and got a missing spill. I added
 --enable-checking=rtl to check other possible bugs. Spill/fill code
 still seems incorrect because different sizes are used.  Shouldn't
 block me though.

 .L5:
 movl16(%esp), %eax
 addl$8, %esi
 movl20(%esp), %edx
 movl%eax, (%esp)
 movl%edx, 4(%esp)
 callcounter@PLT
 movq-8(%esi), %xmm0
 **movdqa  16(%esp), %xmm2**
 pand%xmm0, %xmm2
 movdqa  %xmm2, %xmm0
 movd%xmm2, %edx
 **movq%xmm2, 16(%esp)**
 psrlq   $32, %xmm0
 movd%xmm0, %eax
 orl %edx, %eax
 jne .L5

 Thanks,
 Ilya

I was wrong assuming reloads with wrong size shouldn't block me. These
reloads require memory to be aligned which is not always true. Here is
what I have in RTL now:

(insn 2 7 3 2 (set (reg/v:DI 93 [ l ])
(mem/c:DI (reg/f:SI 16 argp) [1 l+0 S8 A32])) test.c:5 89
{*movdi_internal}
 (nil))
...
(insn 27 26 52 6 (set (subreg:V2DI (reg:DI 87 [ D.1822 ]) 0)
(ior:V2DI (subreg:V2DI (reg:DI 99 [ D.1822 ]) 0)
(subreg:V2DI (reg/v:DI 93 [ l ]) 0))) test.c:11 3489 {*iorv2di3}
 (expr_list:REG_DEAD (reg:DI 99 [ D.1822 ])
(expr_list:REG_DEAD (reg/v:DI 93 [ l ])
(nil

After reload I get:

(insn 2 7 75 2 (set (reg/v:DI 0 ax [orig:93 l ] [93])
(mem/c:DI (plus:SI (reg/f:SI 7 sp)
(const_int 24 [0x18])) [1 l+0 S8 A32])) test.c:5 89
{*movdi_internal}
 (nil))
(insn 75 2 3 2 (set (mem/c:DI (reg/f:SI 7 sp) [3 %sfp+-16 S8 A64])
(reg/v:DI 0 ax [orig:93 l ] [93])) test.c:5 89 {*movdi_internal}
 (nil))
...
(insn 27 26 52 6 (set (reg:V2DI 21 xmm0 [orig:87 D.1822 ] [87])
(ior:V2DI (reg:V2DI 21 xmm0 [orig:99 D.1822 ] [99])
(mem/c:V2DI (reg/f:SI 7 sp) [3 %sfp+-16 S16 A64])))
test.c:11 3489 {*iorv2di3}


'por' instruction requires memory to be aligned and fails in a bigger
testcase. There is also movdqa generated for esp by reload. May it
mean I still have some inconsistencies in the produced RTL? Probably I
should somehow transform loads and stores?

Thanks,
Ilya


ira.log
Description: Binary data


pr65105.patch
Description: Binary data
extern long long arr[];

long long
test (long long l, int i1, int i2)
{
  switch (i2)
{
case 1:
  return l | arr[i1];
case 8:
  return l | arr[i1]  arr[i2];
}
  return l;
}


Re: [i386] Scalar DImode instructions on XMM registers

2015-05-22 Thread Ilya Enkovich
2015-05-21 22:08 GMT+03:00 Vladimir Makarov vmaka...@redhat.com:
 On 05/21/2015 05:54 AM, Ilya Enkovich wrote:

 Thanks.  For me it looks like an inheritance bug.  It is really hard
 to fix the bug w/o the source code.  Could you send me your patch in
 order I can debug RA with it to investigate more.
 

 Sure! Here is a patch and a testcase.  I applied patch to r222125.  Cmd to
 reproduce:

 gcc -m32 -msse4.2 -O2 pr65105.c -S -march=slm -fPIE

 The problem is in sharing a subreg in different insns.  Pseudo should be
 shared but not their subregs.

 We have before inheritance:

28: r132:V2DI=r132:V2DI|r126:DI#0
   REG_DEAD r126:DI
   REG_DEAD r118:DI
 Inserting insn reload before:
81: r132:V2DI=r118:DI#0
 Inserting insn reload after:
82: r108:DI#0=r132:V2DI
 ...
   Creating newreg=135, assigning class SSE_REGS to r135
42: r135:V2DI=r135:V2DIr108:DI#0
   REG_DEAD r127:DI
 Inserting insn reload before:
85: r135:V2DI=r127:DI#0
 Inserting insn reload after:
86: r108:DI#0=r135:V2DI

 As subreg of 108 in original insns 28 and 42 are shared, The subregs of 108
 in insns 82 and 86 are shared too.  During inheritance subpass we change
 r108 in insn 82 onto r137.  This change insn 86 too.

   Creating newreg=137 from oldreg=108, assigning class NO_REX_SSE_REGS
 to inheritance r137
 Original reg change 108-137 (bb2):
82: r137:DI#0=r132:V2DI
   REG_DEAD r132:V2DI
 Add original-inheritance after:
88: r108:DI=r137:DI

 Inheritance reuse change 108-137 (bb2):
68: r124:V2DI=r137:DI#0

 And now we are trying to do inheritance for insn #86:

  Creating newreg=138 from oldreg=108, assigning class NO_REX_SSE_REGS to
 inheritance r138
 Original reg change 108-138 (bb3):
86: r137:DI#0=r135:V2DI
   REG_DEAD r135:V2DI
 Add original-inheritance after:
89: r108:DI=r138:DI

 Inheritance reuse change 108-138 (bb3):
64: r123:V2DI=r137:DI#0

 and after that having a complete mess.  We are trying to change r108 onto
 r138, but r108 is already r137 because of sharing. Later we undo the second
 inheritance creating even more mess.

 So, Ilya, to solve the problem you need to avoid sharing subregs for the
 correct LRA/reload work.



Thanks a lot for your help! I'll fix it.

Ilya


Re: [i386] Scalar DImode instructions on XMM registers

2015-05-22 Thread Ilya Enkovich
2015-05-22 11:53 GMT+03:00 Ilya Enkovich enkovich@gmail.com:
 2015-05-21 22:08 GMT+03:00 Vladimir Makarov vmaka...@redhat.com:
 So, Ilya, to solve the problem you need to avoid sharing subregs for the
 correct LRA/reload work.



 Thanks a lot for your help! I'll fix it.

 Ilya

I've fixed SUBREG sharing and got a missing spill. I added
--enable-checking=rtl to check other possible bugs. Spill/fill code
still seems incorrect because different sizes are used.  Shouldn't
block me though.

.L5:
movl16(%esp), %eax
addl$8, %esi
movl20(%esp), %edx
movl%eax, (%esp)
movl%edx, 4(%esp)
callcounter@PLT
movq-8(%esi), %xmm0
**movdqa  16(%esp), %xmm2**
pand%xmm0, %xmm2
movdqa  %xmm2, %xmm0
movd%xmm2, %edx
**movq%xmm2, 16(%esp)**
psrlq   $32, %xmm0
movd%xmm0, %eax
orl %edx, %eax
jne .L5

Thanks,
Ilya


Re: [i386] Scalar DImode instructions on XMM registers

2015-05-21 Thread Jeff Law

On 05/21/2015 01:08 PM, Vladimir Makarov wrote:

On 05/21/2015 05:54 AM, Ilya Enkovich wrote:

Thanks.  For me it looks like an inheritance bug.  It is really hard
to fix the bug w/o the source code.  Could you send me your patch in
order I can debug RA with it to investigate more.


Sure! Here is a patch and a testcase.  I applied patch to r222125.
Cmd to reproduce:

gcc -m32 -msse4.2 -O2 pr65105.c -S -march=slm -fPIE

The problem is in sharing a subreg in different insns.  Pseudo should be
shared but not their subregs.

[ ... ]


So, Ilya, to solve the problem you need to avoid sharing subregs for the
correct LRA/reload work.
If their code is sharing subregs, then most definitely that code is 
wrong.  GCC has very well defined rtx sharing rules that are defined in 
the developer documentation.


jeff


Re: [i386] Scalar DImode instructions on XMM registers

2015-05-21 Thread Jakub Jelinek
On Thu, May 21, 2015 at 02:23:47PM -0600, Jeff Law wrote:
 On 05/21/2015 01:08 PM, Vladimir Makarov wrote:
 On 05/21/2015 05:54 AM, Ilya Enkovich wrote:
 Thanks.  For me it looks like an inheritance bug.  It is really hard
 to fix the bug w/o the source code.  Could you send me your patch in
 order I can debug RA with it to investigate more.
 
 Sure! Here is a patch and a testcase.  I applied patch to r222125.
 Cmd to reproduce:
 
 gcc -m32 -msse4.2 -O2 pr65105.c -S -march=slm -fPIE
 The problem is in sharing a subreg in different insns.  Pseudo should be
 shared but not their subregs.
 [ ... ]
 
 So, Ilya, to solve the problem you need to avoid sharing subregs for the
 correct LRA/reload work.
 If their code is sharing subregs, then most definitely that code is wrong.
 GCC has very well defined rtx sharing rules that are defined in the
 developer documentation.

Shouldn't --enable-checking=rtl catch such bugs?

Jakub


Re: [i386] Scalar DImode instructions on XMM registers

2015-05-21 Thread Ilya Enkovich
On 20 May 23:27, Vladimir Makarov wrote:
 
 
 On 20/05/15 04:17 AM, Ilya Enkovich wrote:
 On 19 May 11:22, Vladimir Makarov wrote:
 On 05/18/2015 08:13 AM, Ilya Enkovich wrote:
 2015-05-06 17:18 GMT+03:00 Ilya Enkovich enkovich@gmail.com:
 Hi Vladimir,
 
 Could you please comment on this?
 
 
 Ilya, I think that the idea is worth to try but results might be
 mixed.  It is hard to say until you actually try it (as example, Jan
 implemented -fpmath=both and it looks a pretty good idea at least
 for me but when I checked SPEC2000 the results were not so good even
 with IRA/LRA).
 
 Long ago I did some experiments and found that spilling into SSE
 would benefitial for Intel CPUs but not for AMD ones.  As I remember
 I also found that storing several scalar values into one SSE reg and
 extracting it when you need to do some (fp) arithmetics would
 benefitial for AMD but not for Intel CPUs.   In literature more
 general approach is called bitwise register allocator.  Actually it
 would be a pretty big IRA/LRA project from which some targets might
 benefit.
 I suspect such things are not trivially done in IRA/LRA and want to make it 
 as an independent optimization because its application seems to be quite 
 narrow.
 Yes, that is true.  The complications and implementation complexity
 will be probably very high in this project and the positive results
 are not sure.  So the project might have a small value.
 
 As for the wrong code, it is hard for me to say anything w/o RA
 dumps.  If you send me the dump (-fira-verbose=16), i might say more
 what is going on.
 
 
 Here are some dumps from my reproducer.  The problematic register is r108.
 
 Thanks.  For me it looks like an inheritance bug.  It is really hard
 to fix the bug w/o the source code.  Could you send me your patch in
 order I can debug RA with it to investigate more.
 

Sure! Here is a patch and a testcase.  I applied patch to r222125.  Cmd to 
reproduce:

gcc -m32 -msse4.2 -O2 pr65105.c -S -march=slm -fPIE

Thanks,
Ilya
void
counter (long long l);

void
test (long long *arr)
{
  register unsigned long long tmp;

  tmp = arr[0] | arr[1]  arr[2];
  while (tmp)
{
  counter (tmp);
  tmp = *(arr++)  tmp;
}
}
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a607ef4..a9dbfea 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2554,6 +2554,789 @@ rest_of_handle_insert_vzeroupper (void)
   return 0;
 }
 
+static bool
+has_non_address_hard_reg (rtx_insn *insn)
+{
+  df_ref ref;
+  FOR_EACH_INSN_DEF (ref, insn)
+if (HARD_REGISTER_P (DF_REF_REAL_REG (ref))
+!DF_REF_FLAGS_IS_SET (ref, DF_REF_MUST_CLOBBER))
+  return true;
+
+  FOR_EACH_INSN_USE (ref, insn)
+if (!DF_REF_REG_MEM_P(ref)  HARD_REGISTER_P (DF_REF_REAL_REG (ref)))
+  return true;
+
+  return false;
+}
+
+static bool
+scalar_to_vector_candidate_p (rtx_insn *insn)
+{
+  rtx def_set = single_set (insn);
+
+  if (!def_set)
+return false;
+
+  if (has_non_address_hard_reg (insn))
+return false;
+
+  rtx src = SET_SRC (def_set);
+  rtx dst = SET_DEST (def_set);
+
+  /* We are interested in DImode - V1DI promotion
+ only.  */
+  if (GET_MODE (src) != DImode
+  || GET_MODE (dst) != DImode)
+return false;
+
+  if (!REG_P (dst)  !MEM_P (dst))
+return false;
+
+  switch (GET_CODE (src))
+{
+case PLUS:
+case MINUS:
+case IOR:
+case XOR:
+case AND:
+  break;
+
+default:
+  return false;
+}
+
+  if (!REG_P (XEXP (src, 0))  !MEM_P (XEXP (src, 0)))
+  return false;
+
+  if (!REG_P (XEXP (src, 1))  !MEM_P (XEXP (src, 1)))
+  return false;
+
+  if (GET_MODE (XEXP (src, 0)) != DImode
+  || GET_MODE (XEXP (src, 1)) != DImode)
+return false;
+
+  return true;
+}
+
+/* Remove regs having both convertible and
+   not convertible definitions.  */
+static void
+remove_non_convertible_regs (bitmap insns)
+{
+  bitmap_iterator bi;
+  unsigned id;
+  bitmap regs = BITMAP_ALLOC (NULL);
+
+  EXECUTE_IF_SET_IN_BITMAP (insns, 0, id, bi)
+{
+  rtx def_set = single_set (DF_INSN_UID_GET (id)-insn);
+  rtx reg = SET_DEST (def_set);
+
+  if (!REG_P (reg) || bitmap_bit_p (regs, REGNO (reg)))
+   continue;
+
+  for (df_ref def = DF_REG_DEF_CHAIN (REGNO (reg));
+  def;
+  def = DF_REF_NEXT_REG (def))
+   {
+ if (!bitmap_bit_p (insns, DF_REF_INSN_UID (def)))
+   {
+ if (dump_file)
+   fprintf (dump_file,
+r%d has non convertible definition in insn %d\n,
+REGNO (reg), DF_REF_INSN_UID (def));
+
+ bitmap_set_bit (regs, REGNO (reg));
+ break;
+   }
+   }
+}
+
+  EXECUTE_IF_SET_IN_BITMAP (regs, 0, id, bi)
+{
+  for (df_ref def = DF_REG_DEF_CHAIN (id);
+  def;
+  def = DF_REF_NEXT_REG (def))
+   if (bitmap_bit_p (insns, DF_REF_INSN_UID (def)))
+ {
+   if (dump_file)
+ 

Re: [i386] Scalar DImode instructions on XMM registers

2015-05-21 Thread Vladimir Makarov

On 05/21/2015 05:54 AM, Ilya Enkovich wrote:

Thanks.  For me it looks like an inheritance bug.  It is really hard
to fix the bug w/o the source code.  Could you send me your patch in
order I can debug RA with it to investigate more.


Sure! Here is a patch and a testcase.  I applied patch to r222125.  Cmd to 
reproduce:

gcc -m32 -msse4.2 -O2 pr65105.c -S -march=slm -fPIE
The problem is in sharing a subreg in different insns.  Pseudo should be 
shared but not their subregs.


We have before inheritance:

   28: r132:V2DI=r132:V2DI|r126:DI#0
  REG_DEAD r126:DI
  REG_DEAD r118:DI
Inserting insn reload before:
   81: r132:V2DI=r118:DI#0
Inserting insn reload after:
   82: r108:DI#0=r132:V2DI
...
  Creating newreg=135, assigning class SSE_REGS to r135
   42: r135:V2DI=r135:V2DIr108:DI#0
  REG_DEAD r127:DI
Inserting insn reload before:
   85: r135:V2DI=r127:DI#0
Inserting insn reload after:
   86: r108:DI#0=r135:V2DI

As subreg of 108 in original insns 28 and 42 are shared, The subregs of 
108 in insns 82 and 86 are shared too.  During inheritance subpass we 
change r108 in insn 82 onto r137.  This change insn 86 too.


  Creating newreg=137 from oldreg=108, assigning class 
NO_REX_SSE_REGS to inheritance r137

Original reg change 108-137 (bb2):
   82: r137:DI#0=r132:V2DI
  REG_DEAD r132:V2DI
Add original-inheritance after:
   88: r108:DI=r137:DI

Inheritance reuse change 108-137 (bb2):
   68: r124:V2DI=r137:DI#0

And now we are trying to do inheritance for insn #86:

 Creating newreg=138 from oldreg=108, assigning class 
NO_REX_SSE_REGS to inheritance r138

Original reg change 108-138 (bb3):
   86: r137:DI#0=r135:V2DI
  REG_DEAD r135:V2DI
Add original-inheritance after:
   89: r108:DI=r138:DI

Inheritance reuse change 108-138 (bb3):
   64: r123:V2DI=r137:DI#0

and after that having a complete mess.  We are trying to change r108 
onto r138, but r108 is already r137 because of sharing. Later we undo 
the second inheritance creating even more mess.


So, Ilya, to solve the problem you need to avoid sharing subregs for the 
correct LRA/reload work.





Re: [i386] Scalar DImode instructions on XMM registers

2015-05-20 Thread Ilya Enkovich
On 19 May 11:22, Vladimir Makarov wrote:
 On 05/18/2015 08:13 AM, Ilya Enkovich wrote:
 2015-05-06 17:18 GMT+03:00 Ilya Enkovich enkovich@gmail.com:
 Hi Vladimir,
 
 Could you please comment on this?
 
 
 Ilya, I think that the idea is worth to try but results might be
 mixed.  It is hard to say until you actually try it (as example, Jan
 implemented -fpmath=both and it looks a pretty good idea at least
 for me but when I checked SPEC2000 the results were not so good even
 with IRA/LRA).
 
 Long ago I did some experiments and found that spilling into SSE
 would benefitial for Intel CPUs but not for AMD ones.  As I remember
 I also found that storing several scalar values into one SSE reg and
 extracting it when you need to do some (fp) arithmetics would
 benefitial for AMD but not for Intel CPUs.   In literature more
 general approach is called bitwise register allocator.  Actually it
 would be a pretty big IRA/LRA project from which some targets might
 benefit.

I suspect such things are not trivially done in IRA/LRA and want to make it as 
an independent optimization because its application seems to be quite narrow.

 
 
 As for the wrong code, it is hard for me to say anything w/o RA
 dumps.  If you send me the dump (-fira-verbose=16), i might say more
 what is going on.
 
 

Here are some dumps from my reproducer.  The problematic register is r108.

Thanks,
Ilya

;; Function test (test, funcdef_no=0, decl_uid=1933, cgraph_uid=0, 
symbol_order=0)

scanning new insn with uid = 79.
starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called
df_worklist_dataflow_doublequeue:n_basic_blocks 5 n_edges 6 count 5 (1)
starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called
Reg 119: local to bb 2 def dominates all uses has unique first use
Reg 125 uninteresting
Reg 118: local to bb 2 def dominates all uses has unique first use
Reg 126 uninteresting
Reg 127 uninteresting
Found def insn 26 for 119 to be not moveable
;; 2 loops found
;;
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2 3 4
;;
;; Loop 1
;;  header 3, latch 3
;;  depth 1, outer 0
;;  nodes: 3
;; 2 succs { 3 4 }
;; 3 succs { 3 4 }
;; 4 succs { 1 }
starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called
init_insns for 117: (insn_list:REG_DEP_TRUE 22 (nil))


test

Dataflow summary:
;;  invalidated by call  0 [ax] 1 [dx] 2 [cx] 8 [st] 9 [st(1)] 10 
[st(2)] 11 [st(3)] 12 [st(4)] 13 [st(5)] 14 [st(6)] 15 [st(7)] 17 [flags] 18 
[fpsr] 19 [fpcr] 21 [xmm0] 22 [xmm1] 23 [xmm2] 24 [xmm3] 25 [xmm4] 26 [xmm5] 27 
[xmm6] 28 [xmm7] 29 [mm0] 30 [mm1] 31 [mm2] 32 [mm3] 33 [mm4] 34 [mm5] 35 [mm6] 
36 [mm7] 37 [] 38 [] 39 [] 40 [] 41 [] 42 [] 43 [] 44 [] 45 [] 46 [] 47 [] 48 
[] 49 [] 50 [] 51 [] 52 [] 53 [] 54 [] 55 [] 56 [] 57 [] 58 [] 59 [] 60 [] 61 
[] 62 [] 63 [] 64 [] 65 [] 66 [] 67 [] 68 [] 69 [] 70 [] 71 [] 72 [] 73 [] 74 
[] 75 [] 76 [] 77 [] 78 [] 79 [] 80 []
;;  hardware regs used   7 [sp] 16 [argp] 20 [frame]
;;  regular block artificial uses6 [bp] 7 [sp] 16 [argp] 20 [frame]
;;  eh block artificial uses 6 [bp] 7 [sp] 16 [argp] 20 [frame]
;;  entry block defs 0 [ax] 1 [dx] 2 [cx] 6 [bp] 7 [sp] 16 [argp] 20 
[frame] 21 [xmm0] 22 [xmm1] 23 [xmm2] 29 [mm0] 30 [mm1] 31 [mm2]
;;  exit block uses  6 [bp] 7 [sp] 20 [frame]
;;  regs ever live   3[bx] 7[sp] 17[flags]
;;  ref usage   r0={2d} r1={2d} r2={2d} r3={1d,1u} r6={1d,4u} r7={1d,7u} 
r8={1d} r9={1d} r10={1d} r11={1d} r12={1d} r13={1d} r14={1d} r15={1d} 
r16={1d,4u,1e} r17={5d,2u} r18={1d} r19={1d} r20={1d,4u} r21={2d} r22={2d} 
r23={2d} r24={1d} r25={1d} r26={1d} r27={1d} r28={1d} r29={2d} r30={2d} 
r31={2d} r32={1d} r33={1d} r34={1d} r35={1d} r36={1d} r37={1d} r38={1d} 
r39={1d} r40={1d} r41={1d} r42={1d} r43={1d} r44={1d} r45={1d} r46={1d} 
r47={1d} r48={1d} r49={1d} r50={1d} r51={1d} r52={1d} r53={1d} r54={1d} 
r55={1d} r56={1d} r57={1d} r58={1d} r59={1d} r60={1d} r61={1d} r62={1d} 
r63={1d} r64={1d} r65={1d} r66={1d} r67={1d} r68={1d} r69={1d} r70={1d} 
r71={1d} r72={1d} r73={1d} r74={1d} r75={1d} r76={1d} r77={1d} r78={1d} 
r79={1d} r80={1d} r107={1d,1u} r108={2d,4u} r117={2d,5u,2e} r118={1d,1u} 
r119={1d,1u} r123={2d,3u} r124={2d,3u} r125={1d,1u} r126={1d,1u} r127={1d,1u} 
r128={2d,2u} r129={2d,2u} 
;;total ref usage 160{110d,47u,3e} in 25{24 regular + 1 call} insns.
(note 21 0 24 NOTE_INSN_DELETED)
(note 24 21 79 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn/f 79 24 22 2 (parallel [
(set (reg:SI 107)
(unspec:SI [
(const_int 0 [0])
] UNSPEC_SET_GOT))
(clobber (reg:CC 17 flags))
]) 694 {set_got}
 (expr_list:REG_UNUSED (reg:CC 17 flags)
(expr_list:REG_EQUIV (unspec:SI [
(const_int 0 [0])
] UNSPEC_SET_GOT)
(expr_list:REG_CFA_FLUSH_QUEUE (nil)
(nil)
(insn 

Re: [i386] Scalar DImode instructions on XMM registers

2015-05-20 Thread Vladimir Makarov



On 20/05/15 04:17 AM, Ilya Enkovich wrote:

On 19 May 11:22, Vladimir Makarov wrote:

On 05/18/2015 08:13 AM, Ilya Enkovich wrote:

2015-05-06 17:18 GMT+03:00 Ilya Enkovich enkovich@gmail.com:
Hi Vladimir,

Could you please comment on this?



Ilya, I think that the idea is worth to try but results might be
mixed.  It is hard to say until you actually try it (as example, Jan
implemented -fpmath=both and it looks a pretty good idea at least
for me but when I checked SPEC2000 the results were not so good even
with IRA/LRA).

Long ago I did some experiments and found that spilling into SSE
would benefitial for Intel CPUs but not for AMD ones.  As I remember
I also found that storing several scalar values into one SSE reg and
extracting it when you need to do some (fp) arithmetics would
benefitial for AMD but not for Intel CPUs.   In literature more
general approach is called bitwise register allocator.  Actually it
would be a pretty big IRA/LRA project from which some targets might
benefit.

I suspect such things are not trivially done in IRA/LRA and want to make it as 
an independent optimization because its application seems to be quite narrow.
Yes, that is true.  The complications and implementation complexity will 
be probably very high in this project and the positive results are not 
sure.  So the project might have a small value.


As for the wrong code, it is hard for me to say anything w/o RA
dumps.  If you send me the dump (-fira-verbose=16), i might say more
what is going on.



Here are some dumps from my reproducer.  The problematic register is r108.

Thanks.  For me it looks like an inheritance bug.  It is really hard to 
fix the bug w/o the source code.  Could you send me your patch in order 
I can debug RA with it to investigate more.




Re: [i386] Scalar DImode instructions on XMM registers

2015-05-19 Thread Vladimir Makarov

On 05/18/2015 08:13 AM, Ilya Enkovich wrote:

2015-05-06 17:18 GMT+03:00 Ilya Enkovich enkovich@gmail.com:

2015-04-25 4:32 GMT+03:00 Jan Hubicka hubi...@ucw.cz:

Hi,
I am adding Vladimir and Richard into CC. I tried to solve similar problem
with FP math years ago by having -mfpmath=sse,i387. The idea was to allow
use of i387 registers when SSE ones run out and possibly also model the fact
that Pentium4 had faster i387 additions than SSE additions. I also had some
plans to extend this one mixed SSE/MMX/GPR integer arithmetics, but never
got to that.

This did not really fly becuase of the regalloc not really being able to
understnad it (I made path to regclass to propagate the classes and figure out
what operations needs to stay in i387 and what in SSE to avoid reloading, but
that never got in).

I believe Vladimir did some work on this with IRA (he is able to spill GPR
regs into SSE and do bit of other tricks).

Also I believe it was kind of Richard's design deicsion to avoid use of
(paradoxical) subregs for vector conversions because these have funny
implications.

The code for handling upper parts of paradoxical subregs is controlled by
macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle
V1DI-V2DI conversions fluently without some middle-end hacking. (it will
probably try to produce zero extensions)

When we are on SSE instructions, it would be great to finally teach
copy_by_pieces/store_by_pieces to use vector instructions (these are more
compact and either equaly fast or faster on some CPUs). I hope to get into
this, but it would be great if someone beat me.

Honza


I'm trying to implement it as separate RTL pass which chooses a
scalar/vector mode for each 64bit computation chain and performs
transformation if we choose to use vectors. I also want to split DI
instructions which are going to be implemented on GPRs before RA
(currently it is done on the second split). Good metrics for such
transformation is a big question but currently I can't even make it
generate correct code when paradoxical subregs are used. It works in
simple cases but I get troubles when spills appear.

Trying to beat the following testcase:

test (long long *arr)
{
   register unsigned long long tmp;
   tmp = arr[0] | arr[1]  arr[2];
   while (tmp)
 {
   counter (tmp);
   tmp = *(arr++)  tmp;
 }
}

RTL I generate seems OK to me (ignoring the fact that it is not optimal):

(insn 6 3 50 2 (set (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ])
 (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
 (const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D)
+ 8B]+0 S8 A64])) pr65105-1.c:22 89 {*movdi_internal}
  (nil))
(insn 50 6 7 2 (set (reg:DI 104)
 (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
 (const_int 16 [0x10])) [2 MEM[(long long int
*)arr_5(D) + 16B]+0 S8 A64])) pr65105-1.c:22 -1
  (nil))
(insn 7 50 51 2 (set (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
 (and:V2DI (subreg:V2DI (reg:DI 98 [ MEM[(long long int
*)arr_5(D) + 8B] ]) 0)
 (subreg:V2DI (reg:DI 104) 0))) pr65105-1.c:22 3487 {*andv2di3}
  (expr_list:REG_DEAD (subreg:V2DI (reg:DI 98 [ MEM[(long long int
*)arr_5(D) + 8B] ]) 0)
 (expr_list:REG_UNUSED (reg:CC 17 flags)
 (expr_list:REG_EQUAL (and:DI (mem:DI (plus:SI (reg/v/f:SI
96 [ arr ])
 (const_int 8 [0x8])) [2 MEM[(long long int
*)arr_5(D) + 8B]+0 S8 A64])
 (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
 (const_int 16 [0x10])) [2 MEM[(long long
int *)arr_5(D) + 16B]+0 S8 A64]))
 (nil)
(insn 51 7 8 2 (set (reg:DI 105)
 (mem:DI (reg/v/f:SI 96 [ arr ]) [2 *arr_5(D)+0 S8 A64]))
pr65105-1.c:22 -1
  (nil))
(insn 8 51 46 2 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)
 (ior:V2DI (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
 (subreg:V2DI (reg:DI 105) 0))) pr65105-1.c:22 3489 {*iorv2di3}
  (expr_list:REG_DEAD (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
 (expr_list:REG_UNUSED (reg:CC 17 flags)
 (nil
(insn 46 8 47 2 (set (reg:V2DI 103)
 (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:22 -1
  (nil))
(insn 47 46 48 2 (set (subreg:SI (reg:DI 101) 0)
 (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1
  (nil))
(insn 48 47 49 2 (set (reg:V2DI 103)
 (lshiftrt:V2DI (reg:V2DI 103)
 (const_int 32 [0x20]))) pr65105-1.c:22 -1
  (nil))
(insn 49 48 9 2 (set (subreg:SI (reg:DI 101) 4)
 (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1
  (nil))
(note 9 49 10 2 NOTE_INSN_DELETED)
(insn 10 9 11 2 (parallel [
 (set (reg:CCZ 17 flags)
 (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4)
 (subreg:SI (reg:DI 101) 0))
 (const_int 0 [0])))
 (clobber (scratch:SI))
 ]) pr65105-1.c:23 447 {*iorsi_3}
  (nil))
(jump_insn 11 10 37 2 (set (pc)
 (if_then_else 

Re: [i386] Scalar DImode instructions on XMM registers

2015-05-18 Thread Ilya Enkovich
2015-05-06 17:18 GMT+03:00 Ilya Enkovich enkovich@gmail.com:
 2015-04-25 4:32 GMT+03:00 Jan Hubicka hubi...@ucw.cz:
 Hi,
 I am adding Vladimir and Richard into CC. I tried to solve similar problem
 with FP math years ago by having -mfpmath=sse,i387. The idea was to allow
 use of i387 registers when SSE ones run out and possibly also model the fact
 that Pentium4 had faster i387 additions than SSE additions. I also had some
 plans to extend this one mixed SSE/MMX/GPR integer arithmetics, but never
 got to that.

 This did not really fly becuase of the regalloc not really being able to
 understnad it (I made path to regclass to propagate the classes and figure 
 out
 what operations needs to stay in i387 and what in SSE to avoid reloading, but
 that never got in).

 I believe Vladimir did some work on this with IRA (he is able to spill GPR
 regs into SSE and do bit of other tricks).

 Also I believe it was kind of Richard's design deicsion to avoid use of
 (paradoxical) subregs for vector conversions because these have funny
 implications.

 The code for handling upper parts of paradoxical subregs is controlled by
 macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle
 V1DI-V2DI conversions fluently without some middle-end hacking. (it will
 probably try to produce zero extensions)

 When we are on SSE instructions, it would be great to finally teach
 copy_by_pieces/store_by_pieces to use vector instructions (these are more
 compact and either equaly fast or faster on some CPUs). I hope to get into
 this, but it would be great if someone beat me.

 Honza


 I'm trying to implement it as separate RTL pass which chooses a
 scalar/vector mode for each 64bit computation chain and performs
 transformation if we choose to use vectors. I also want to split DI
 instructions which are going to be implemented on GPRs before RA
 (currently it is done on the second split). Good metrics for such
 transformation is a big question but currently I can't even make it
 generate correct code when paradoxical subregs are used. It works in
 simple cases but I get troubles when spills appear.

 Trying to beat the following testcase:

 test (long long *arr)
 {
   register unsigned long long tmp;
   tmp = arr[0] | arr[1]  arr[2];
   while (tmp)
 {
   counter (tmp);
   tmp = *(arr++)  tmp;
 }
 }

 RTL I generate seems OK to me (ignoring the fact that it is not optimal):

 (insn 6 3 50 2 (set (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ])
 (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
 (const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D)
 + 8B]+0 S8 A64])) pr65105-1.c:22 89 {*movdi_internal}
  (nil))
 (insn 50 6 7 2 (set (reg:DI 104)
 (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
 (const_int 16 [0x10])) [2 MEM[(long long int
 *)arr_5(D) + 16B]+0 S8 A64])) pr65105-1.c:22 -1
  (nil))
 (insn 7 50 51 2 (set (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
 (and:V2DI (subreg:V2DI (reg:DI 98 [ MEM[(long long int
 *)arr_5(D) + 8B] ]) 0)
 (subreg:V2DI (reg:DI 104) 0))) pr65105-1.c:22 3487 {*andv2di3}
  (expr_list:REG_DEAD (subreg:V2DI (reg:DI 98 [ MEM[(long long int
 *)arr_5(D) + 8B] ]) 0)
 (expr_list:REG_UNUSED (reg:CC 17 flags)
 (expr_list:REG_EQUAL (and:DI (mem:DI (plus:SI (reg/v/f:SI
 96 [ arr ])
 (const_int 8 [0x8])) [2 MEM[(long long int
 *)arr_5(D) + 8B]+0 S8 A64])
 (mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
 (const_int 16 [0x10])) [2 MEM[(long long
 int *)arr_5(D) + 16B]+0 S8 A64]))
 (nil)
 (insn 51 7 8 2 (set (reg:DI 105)
 (mem:DI (reg/v/f:SI 96 [ arr ]) [2 *arr_5(D)+0 S8 A64]))
 pr65105-1.c:22 -1
  (nil))
 (insn 8 51 46 2 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)
 (ior:V2DI (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
 (subreg:V2DI (reg:DI 105) 0))) pr65105-1.c:22 3489 {*iorv2di3}
  (expr_list:REG_DEAD (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
 (expr_list:REG_UNUSED (reg:CC 17 flags)
 (nil
 (insn 46 8 47 2 (set (reg:V2DI 103)
 (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:22 -1
  (nil))
 (insn 47 46 48 2 (set (subreg:SI (reg:DI 101) 0)
 (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1
  (nil))
 (insn 48 47 49 2 (set (reg:V2DI 103)
 (lshiftrt:V2DI (reg:V2DI 103)
 (const_int 32 [0x20]))) pr65105-1.c:22 -1
  (nil))
 (insn 49 48 9 2 (set (subreg:SI (reg:DI 101) 4)
 (subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1
  (nil))
 (note 9 49 10 2 NOTE_INSN_DELETED)
 (insn 10 9 11 2 (parallel [
 (set (reg:CCZ 17 flags)
 (compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4)
 (subreg:SI (reg:DI 101) 0))
 (const_int 0 [0])))
 (clobber (scratch:SI))
 ]) pr65105-1.c:23 447 {*iorsi_3}
  (nil))
 (jump_insn 11 10 37 2 (set (pc)
 

Re: [i386] Scalar DImode instructions on XMM registers

2015-05-07 Thread Richard Henderson
On 05/07/2015 10:59 AM, Uros Bizjak wrote:
 If we consider SSE operations as DImode operations, we will loose the
 ability to precisely specify which operation (SSE vs. general reg) we
 want. I'm afraid that in DImode case, combine will choose FLAG-less
 pattern that will mandate moves from general regs to SSE regs and
 back. This was the reason to invent V1DImode/V1TImode vectors to
 avoid moving double-mode values to MMX/SSE regs for double-mode
 shifts.

It would of course have to be a combined pattern.

The problem being addressed by V1TImode is that SSE doesn't really support
TImode arithmetic.  We've got some logical operations and restricted shifting,
but no addition, multiplication, or fully general shifting.

The problem being addressed by V1DImode is MMX, about which I believe I need
say nothing more, and the fact that lower-subreg produces better results than
the current RA.


r~


Re: [i386] Scalar DImode instructions on XMM registers

2015-05-07 Thread Richard Henderson
On 05/07/2015 09:24 AM, Richard Henderson wrote:
 I was wondering this morning about the possibility of a kind of constraint 
 that
 would allow RA to generate pairs of registers via CONCAT.  That is, the two
 hard registers within the CONCAT are collectively the double-word allocation,
 but need not be sequential like current multi-word allocations.  A target 
 using
 such a constraint is promising to handle the CONCAT either by splitting (and
 gen_lowpart et al), or print_operand letters (e.g. the m68k %R, for outputting
 the low part of a pair).
 
 With that, we get the best of both -- lower-subreg effectively happening in 
 RA,
 and DImode arithmetic in SSE no subregs required.

I forgot one issue that lower-subreg also cures -- describing the lifetime of
the pair of registers.  We wouldn't get that with a single bit saying that
CONCAT is ok.

E.g.

di100 = di101 + di102

split to

(flags, si200) = si201 + si202
si300  = si301 + si302 + carry(flags)

If we split prior to RA, we can see that si200 cannot overlap si301 or si302.
If we split after RA, we have to handle this ourselves in the backend, leading
to additional matching-constraint alternatives and/or early-clobbers.

We'd need a couple of bits: one saying that concat is ok, the other saying
whether all lows are consumed before all highs, when allocating a set of
CONCATs across all of the operands.

Or perhaps we don't need such a bit and we merely include high inputs not
clobbered by low output as part of the contract with RA.



r~


Re: [i386] Scalar DImode instructions on XMM registers

2015-05-07 Thread Uros Bizjak
On Thu, May 7, 2015 at 6:24 PM, Richard Henderson r...@redhat.com wrote:
 On 04/24/2015 06:32 PM, Jan Hubicka wrote:
 Also I believe it was kind of Richard's design deicsion to avoid use of
 (paradoxical) subregs for vector conversions because these have funny
 implications.

 Yes indeed.

 The code for handling upper parts of paradoxical subregs is controlled by
 macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle
 V1DI-V2DI conversions fluently without some middle-end hacking. (it will
 probably try to produce zero extensions)

 When we are on SSE instructions, it would be great to finally teach
 copy_by_pieces/store_by_pieces to use vector instructions (these are more
 compact and either equaly fast or faster on some CPUs). I hope to get into
 this, but it would be great if someone beat me.

 Well, I think it would be worthwhile to teach the i386 backend how to do 
 64-bit
 vectors in SSE registers.  First, this would aid portability with other 
 targets
 who may have GCC generic vectors written only for 8 byte quantities.  Since we
 do have zero-extending 8 byte load/store insns for SSE, we don't actually need
 paradoxical regs, just additional macro-ization of the existing patterns.

If we consider SSE operations as DImode operations, we will loose the
ability to precisely specify which operation (SSE vs. general reg) we
want. I'm afraid that in DImode case, combine will choose FLAG-less
pattern that will mandate moves from general regs to SSE regs and
back. This was the reason to invent V1DImode/V1TImode vectors to
avoid moving double-mode values to MMX/SSE regs for double-mode
shifts.

The alternative would be RA that is able to select between alternative
instructions, not only between alternative register classes.

Uros.


Re: [i386] Scalar DImode instructions on XMM registers

2015-05-07 Thread Richard Henderson
On 04/24/2015 06:32 PM, Jan Hubicka wrote:
 Also I believe it was kind of Richard's design deicsion to avoid use of
 (paradoxical) subregs for vector conversions because these have funny
 implications.

Yes indeed.

 The code for handling upper parts of paradoxical subregs is controlled by
 macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle
 V1DI-V2DI conversions fluently without some middle-end hacking. (it will
 probably try to produce zero extensions)
 
 When we are on SSE instructions, it would be great to finally teach
 copy_by_pieces/store_by_pieces to use vector instructions (these are more
 compact and either equaly fast or faster on some CPUs). I hope to get into
 this, but it would be great if someone beat me.

Well, I think it would be worthwhile to teach the i386 backend how to do 64-bit
vectors in SSE registers.  First, this would aid portability with other targets
who may have GCC generic vectors written only for 8 byte quantities.  Since we
do have zero-extending 8 byte load/store insns for SSE, we don't actually need
paradoxical regs, just additional macro-ization of the existing patterns.

This almost certainly would conflict with the MMX code generation.  But given
the problems we've always had with that, perhaps it's time to kill that off.
To a large extent we can preserve source compatibility with MMX builtins once
we have 8-byte vectors implemented in SSE.

As for the subject, we'd want to delay expansion of DImode arithmetic until
after RA.  That bypasses all of the good work done in lower-subreg.c, so we
need some sort of replacement.

I was wondering this morning about the possibility of a kind of constraint that
would allow RA to generate pairs of registers via CONCAT.  That is, the two
hard registers within the CONCAT are collectively the double-word allocation,
but need not be sequential like current multi-word allocations.  A target using
such a constraint is promising to handle the CONCAT either by splitting (and
gen_lowpart et al), or print_operand letters (e.g. the m68k %R, for outputting
the low part of a pair).

With that, we get the best of both -- lower-subreg effectively happening in RA,
and DImode arithmetic in SSE no subregs required.


r~


Re: [i386] Scalar DImode instructions on XMM registers

2015-05-06 Thread Ilya Enkovich
2015-04-25 4:32 GMT+03:00 Jan Hubicka hubi...@ucw.cz:
 Hi,
 I am adding Vladimir and Richard into CC. I tried to solve similar problem
 with FP math years ago by having -mfpmath=sse,i387. The idea was to allow
 use of i387 registers when SSE ones run out and possibly also model the fact
 that Pentium4 had faster i387 additions than SSE additions. I also had some
 plans to extend this one mixed SSE/MMX/GPR integer arithmetics, but never
 got to that.

 This did not really fly becuase of the regalloc not really being able to
 understnad it (I made path to regclass to propagate the classes and figure out
 what operations needs to stay in i387 and what in SSE to avoid reloading, but
 that never got in).

 I believe Vladimir did some work on this with IRA (he is able to spill GPR
 regs into SSE and do bit of other tricks).

 Also I believe it was kind of Richard's design deicsion to avoid use of
 (paradoxical) subregs for vector conversions because these have funny
 implications.

 The code for handling upper parts of paradoxical subregs is controlled by
 macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle
 V1DI-V2DI conversions fluently without some middle-end hacking. (it will
 probably try to produce zero extensions)

 When we are on SSE instructions, it would be great to finally teach
 copy_by_pieces/store_by_pieces to use vector instructions (these are more
 compact and either equaly fast or faster on some CPUs). I hope to get into
 this, but it would be great if someone beat me.

 Honza


I'm trying to implement it as separate RTL pass which chooses a
scalar/vector mode for each 64bit computation chain and performs
transformation if we choose to use vectors. I also want to split DI
instructions which are going to be implemented on GPRs before RA
(currently it is done on the second split). Good metrics for such
transformation is a big question but currently I can't even make it
generate correct code when paradoxical subregs are used. It works in
simple cases but I get troubles when spills appear.

Trying to beat the following testcase:

test (long long *arr)
{
  register unsigned long long tmp;
  tmp = arr[0] | arr[1]  arr[2];
  while (tmp)
{
  counter (tmp);
  tmp = *(arr++)  tmp;
}
}

RTL I generate seems OK to me (ignoring the fact that it is not optimal):

(insn 6 3 50 2 (set (reg:DI 98 [ MEM[(long long int *)arr_5(D) + 8B] ])
(mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
(const_int 8 [0x8])) [2 MEM[(long long int *)arr_5(D)
+ 8B]+0 S8 A64])) pr65105-1.c:22 89 {*movdi_internal}
 (nil))
(insn 50 6 7 2 (set (reg:DI 104)
(mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
(const_int 16 [0x10])) [2 MEM[(long long int
*)arr_5(D) + 16B]+0 S8 A64])) pr65105-1.c:22 -1
 (nil))
(insn 7 50 51 2 (set (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
(and:V2DI (subreg:V2DI (reg:DI 98 [ MEM[(long long int
*)arr_5(D) + 8B] ]) 0)
(subreg:V2DI (reg:DI 104) 0))) pr65105-1.c:22 3487 {*andv2di3}
 (expr_list:REG_DEAD (subreg:V2DI (reg:DI 98 [ MEM[(long long int
*)arr_5(D) + 8B] ]) 0)
(expr_list:REG_UNUSED (reg:CC 17 flags)
(expr_list:REG_EQUAL (and:DI (mem:DI (plus:SI (reg/v/f:SI
96 [ arr ])
(const_int 8 [0x8])) [2 MEM[(long long int
*)arr_5(D) + 8B]+0 S8 A64])
(mem:DI (plus:SI (reg/v/f:SI 96 [ arr ])
(const_int 16 [0x10])) [2 MEM[(long long
int *)arr_5(D) + 16B]+0 S8 A64]))
(nil)
(insn 51 7 8 2 (set (reg:DI 105)
(mem:DI (reg/v/f:SI 96 [ arr ]) [2 *arr_5(D)+0 S8 A64]))
pr65105-1.c:22 -1
 (nil))
(insn 8 51 46 2 (set (subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)
(ior:V2DI (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
(subreg:V2DI (reg:DI 105) 0))) pr65105-1.c:22 3489 {*iorv2di3}
 (expr_list:REG_DEAD (subreg:V2DI (reg:DI 97 [ D.2586 ]) 0)
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil
(insn 46 8 47 2 (set (reg:V2DI 103)
(subreg:V2DI (reg/v:DI 87 [ tmp ]) 0)) pr65105-1.c:22 -1
 (nil))
(insn 47 46 48 2 (set (subreg:SI (reg:DI 101) 0)
(subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1
 (nil))
(insn 48 47 49 2 (set (reg:V2DI 103)
(lshiftrt:V2DI (reg:V2DI 103)
(const_int 32 [0x20]))) pr65105-1.c:22 -1
 (nil))
(insn 49 48 9 2 (set (subreg:SI (reg:DI 101) 4)
(subreg:SI (reg:V2DI 103) 0)) pr65105-1.c:22 -1
 (nil))
(note 9 49 10 2 NOTE_INSN_DELETED)
(insn 10 9 11 2 (parallel [
(set (reg:CCZ 17 flags)
(compare:CCZ (ior:SI (subreg:SI (reg:DI 101) 4)
(subreg:SI (reg:DI 101) 0))
(const_int 0 [0])))
(clobber (scratch:SI))
]) pr65105-1.c:23 447 {*iorsi_3}
 (nil))
(jump_insn 11 10 37 2 (set (pc)
(if_then_else (ne (reg:CCZ 17 flags)
(const_int 0 [0]))
(label_ref:SI 37)
(pc))) pr65105-1.c:23 619 {*jcc_1}

Re: [i386] Scalar DImode instructions on XMM registers

2015-04-24 Thread Uros Bizjak
On Fri, Apr 24, 2015 at 11:45 AM, Uros Bizjak ubiz...@gmail.com wrote:
 On Fri, Apr 24, 2015 at 11:22 AM, Ilya Enkovich enkovich@gmail.com 
 wrote:

 I was looking into PR65105 and tried to generate SSE computation for a
 simple 64bit  a + b + c sequence. Having no scalar integer instructions in
 SSE I have to use vector variants.

 Is this approach really better that having two add/addc instructions?

FYI, V1DI mode was introduced because XMM shift insn were used to
shift DImode values. The cost of moves from/to integer DImode reg pair
was disastrous.

Uros.


Re: [i386] Scalar DImode instructions on XMM registers

2015-04-24 Thread Uros Bizjak
On Fri, Apr 24, 2015 at 11:22 AM, Ilya Enkovich enkovich@gmail.com wrote:

 I was looking into PR65105 and tried to generate SSE computation for a
 simple 64bit  a + b + c sequence. Having no scalar integer instructions in
 SSE I have to use vector variants.

Is this approach really better that having two add/addc instructions?

Uros.


Re: [i386] Scalar DImode instructions on XMM registers

2015-04-24 Thread Uros Bizjak
On Fri, Apr 24, 2015 at 12:14 PM, Uros Bizjak ubiz...@gmail.com wrote:

 I was looking into PR65105 and tried to generate SSE computation for a
 simple 64bit  a + b + c sequence. Having no scalar integer instructions in
 SSE I have to use vector variants.

 Is this approach really better that having two add/addc instructions?

 FYI, V1DI mode was introduced because XMM shift insn were used to
 shift DImode values. The cost of moves from/to integer DImode reg pair
 was disastrous.

 Uros.

 Does it mean I have to add V1DI instructions for all opcodes I want to
 transform (add,sub,mul,or,and, etc.)?

 No.

 Please try to generate paradoxical subreg (V2DImode subreg of V1DImode
 pseudo). IIRC, there is some functionality in the compiler that is
 able to tell if the highpart of the paradoxical register is zeroed.

Probably you can even generate paradoxical V2DImode subreg of DImode.
I'm not sure if in this case register allocator degenerates the mode
of resulting hard register to DImode, it is worth a try.

Uros.


Re: [i386] Scalar DImode instructions on XMM registers

2015-04-24 Thread Ilya Enkovich
2015-04-24 12:49 GMT+03:00 Uros Bizjak ubiz...@gmail.com:
 On Fri, Apr 24, 2015 at 11:45 AM, Uros Bizjak ubiz...@gmail.com wrote:
 On Fri, Apr 24, 2015 at 11:22 AM, Ilya Enkovich enkovich@gmail.com 
 wrote:

 I was looking into PR65105 and tried to generate SSE computation for a
 simple 64bit  a + b + c sequence. Having no scalar integer instructions in
 SSE I have to use vector variants.

 Is this approach really better that having two add/addc instructions?

 FYI, V1DI mode was introduced because XMM shift insn were used to
 shift DImode values. The cost of moves from/to integer DImode reg pair
 was disastrous.

 Uros.

Does it mean I have to add V1DI instructions for all opcodes I want to
transform (add,sub,mul,or,and, etc.)?

Ilya


Re: [i386] Scalar DImode instructions on XMM registers

2015-04-24 Thread Uros Bizjak
On Fri, Apr 24, 2015 at 12:09 PM, Ilya Enkovich enkovich@gmail.com wrote:

 I was looking into PR65105 and tried to generate SSE computation for a
 simple 64bit  a + b + c sequence. Having no scalar integer instructions in
 SSE I have to use vector variants.

 Is this approach really better that having two add/addc instructions?

 FYI, V1DI mode was introduced because XMM shift insn were used to
 shift DImode values. The cost of moves from/to integer DImode reg pair
 was disastrous.

 Uros.

 Does it mean I have to add V1DI instructions for all opcodes I want to
 transform (add,sub,mul,or,and, etc.)?

No.

Please try to generate paradoxical subreg (V2DImode subreg of V1DImode
pseudo). IIRC, there is some functionality in the compiler that is
able to tell if the highpart of the paradoxical register is zeroed.

Uros.


Re: [i386] Scalar DImode instructions on XMM registers

2015-04-24 Thread Marc Glisse

On Fri, 24 Apr 2015, Uros Bizjak wrote:


Please try to generate paradoxical subreg (V2DImode subreg of V1DImode
pseudo). IIRC, there is some functionality in the compiler that is
able to tell if the highpart of the paradoxical register is zeroed.


Those are not currently legal (I tried to change that)
https://gcc.gnu.org/ml/gcc-patches/2013-03/msg00745.html
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00769.html

In this case, a subreg:V2DI of DImode should work.

--
Marc Glisse


Re: [i386] Scalar DImode instructions on XMM registers

2015-04-24 Thread Ilya Enkovich
2015-04-24 12:45 GMT+03:00 Uros Bizjak ubiz...@gmail.com:
 On Fri, Apr 24, 2015 at 11:22 AM, Ilya Enkovich enkovich@gmail.com 
 wrote:

 I was looking into PR65105 and tried to generate SSE computation for a
 simple 64bit  a + b + c sequence. Having no scalar integer instructions in
 SSE I have to use vector variants.

 Is this approach really better that having two add/addc instructions?

We surely shouldn't apply this for each DI instruction and compute
transformation costs. It is profitable if not many conversions
required, it helps to relax GPR pressure, we expect it to be
profitable for mul. Performance tests will show if this is useful. I
want to make a small prototype and try it.

Ilya


 Uros.


Re: [i386] Scalar DImode instructions on XMM registers

2015-04-24 Thread Ilya Enkovich
2015-04-24 13:27 GMT+03:00 Marc Glisse marc.gli...@inria.fr:
 On Fri, 24 Apr 2015, Uros Bizjak wrote:

 Please try to generate paradoxical subreg (V2DImode subreg of V1DImode
 pseudo). IIRC, there is some functionality in the compiler that is
 able to tell if the highpart of the paradoxical register is zeroed.


 Those are not currently legal (I tried to change that)
 https://gcc.gnu.org/ml/gcc-patches/2013-03/msg00745.html
 https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00769.html

 In this case, a subreg:V2DI of DImode should work.

 --
 Marc Glisse

Thank you for you tips! It seems to work, will try and see what it
gives us for i386.

Thanks,
Ilya


Re: [i386] Scalar DImode instructions on XMM registers

2015-04-24 Thread Jan Hubicka
Hi,
I am adding Vladimir and Richard into CC. I tried to solve similar problem
with FP math years ago by having -mfpmath=sse,i387. The idea was to allow
use of i387 registers when SSE ones run out and possibly also model the fact
that Pentium4 had faster i387 additions than SSE additions. I also had some
plans to extend this one mixed SSE/MMX/GPR integer arithmetics, but never
got to that.

This did not really fly becuase of the regalloc not really being able to
understnad it (I made path to regclass to propagate the classes and figure out
what operations needs to stay in i387 and what in SSE to avoid reloading, but
that never got in).

I believe Vladimir did some work on this with IRA (he is able to spill GPR
regs into SSE and do bit of other tricks).

Also I believe it was kind of Richard's design deicsion to avoid use of
(paradoxical) subregs for vector conversions because these have funny
implications.

The code for handling upper parts of paradoxical subregs is controlled by
macros around SUBREG_PROMOTED_VAR_P but I do not think it will handle
V1DI-V2DI conversions fluently without some middle-end hacking. (it will
probably try to produce zero extensions)

When we are on SSE instructions, it would be great to finally teach
copy_by_pieces/store_by_pieces to use vector instructions (these are more
compact and either equaly fast or faster on some CPUs). I hope to get into
this, but it would be great if someone beat me.

Honza

 2015-04-24 13:27 GMT+03:00 Marc Glisse marc.gli...@inria.fr:
  On Fri, 24 Apr 2015, Uros Bizjak wrote:
 
  Please try to generate paradoxical subreg (V2DImode subreg of V1DImode
  pseudo). IIRC, there is some functionality in the compiler that is
  able to tell if the highpart of the paradoxical register is zeroed.
 
 
  Those are not currently legal (I tried to change that)
  https://gcc.gnu.org/ml/gcc-patches/2013-03/msg00745.html
  https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00769.html
 
  In this case, a subreg:V2DI of DImode should work.
 
  --
  Marc Glisse
 
 Thank you for you tips! It seems to work, will try and see what it
 gives us for i386.
 
 Thanks,
 Ilya