[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-21 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

--- Comment #13 from ktkachov at gcc dot gnu.org ---
.

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-21 Thread edlinger at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

Bernd Edlinger  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |wdijkstr at arm dot com

--- Comment #12 from Bernd Edlinger  ---
(In reply to Wilco from comment #11)
> (In reply to ktkachov from comment #10)
> > Confirmed then. Wilco, if you're working on this can you please assign it to
> > yourself?
> 
> Unfortunately the form doesn't allow me to do anything with the headers...

I know that happens.

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-21 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

--- Comment #11 from Wilco  ---
(In reply to ktkachov from comment #10)
> Confirmed then. Wilco, if you're working on this can you please assign it to
> yourself?

Unfortunately the form doesn't allow me to do anything with the headers...

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-21 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2016-10-21
 CC||ktkachov at gcc dot gnu.org
 Ever confirmed|0   |1
  Known to fail||4.9.4, 5.4.1, 6.2.1, 7.0

--- Comment #10 from ktkachov at gcc dot gnu.org ---
Confirmed then. Wilco, if you're working on this can you please assign it to
yourself?

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-20 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

--- Comment #9 from Bernd Edlinger  ---
(In reply to Wilco from comment #8)
> 
> I've got a patch that fixes it, it's being tested.
> 
> While looking at how DI mode operations get expanded, I noticed there is a
> CQ issue with your shift change. Shifts that are expanded early now use
> extra registers due to the DI mode write of zero. Given all other DI mode
> operations are expanded after reload, it may be better to do the same for
> shifts too.

Interesting idea.  After reload there is no need anymore to zero the
DI mode register at all, so that could become completely unnecessary.

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-20 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

--- Comment #8 from Wilco  ---
(In reply to Bernd Edlinger from comment #7)
> (In reply to Richard Earnshaw from comment #6)
> > (In reply to Bernd Edlinger from comment #5)
> > > (In reply to Wilco from comment #4)
> > > > However dealing with partial overlaps is complex so maybe the best 
> > > > option
> > > > would be to add alternatives to di3_neon to either allow full 
> > > > overlap
> > > > "r 0 X X X" or no overlap " r X  X X". The shift code works with full
> > > > overlap.
> > > 
> > > That sounds like a good idea.
> > > 
> > > Then this condition in di3_neon could go away too:
> > > 
> > > && (!reg_overlap_mentioned_p (operands[0], operands[1])
> > > || REGNO (operands[0]) == REGNO (operands[1])))
> > 
> > Note that we don't want to restrict complete overlaps, only partial
> > overlaps.  Restricting complete overlaps leads to significant increase in
> > register pressure and a lot of redundant copying.
> 
> Yes.
> 
> That is Wilco's idea: instead of =r 0r X X X
> use =r 0 X X X and = r X X X, that should ensure that
> no partial overlap happens, just full overlap or nothing.
> 
> That's what arm_emit_coreregs_64bit_shift
> and arm_ashldi3_1bit can handle.
> 
> Who will do it?

I've got a patch that fixes it, it's being tested.

While looking at how DI mode operations get expanded, I noticed there is a CQ
issue with your shift change. Shifts that are expanded early now use extra
registers due to the DI mode write of zero. Given all other DI mode operations
are expanded after reload, it may be better to do the same for shifts too.

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-20 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

--- Comment #7 from Bernd Edlinger  ---
(In reply to Richard Earnshaw from comment #6)
> (In reply to Bernd Edlinger from comment #5)
> > (In reply to Wilco from comment #4)
> > > However dealing with partial overlaps is complex so maybe the best option
> > > would be to add alternatives to di3_neon to either allow full 
> > > overlap
> > > "r 0 X X X" or no overlap " r X  X X". The shift code works with full
> > > overlap.
> > 
> > That sounds like a good idea.
> > 
> > Then this condition in di3_neon could go away too:
> > 
> > && (!reg_overlap_mentioned_p (operands[0], operands[1])
> > || REGNO (operands[0]) == REGNO (operands[1])))
> 
> Note that we don't want to restrict complete overlaps, only partial
> overlaps.  Restricting complete overlaps leads to significant increase in
> register pressure and a lot of redundant copying.

Yes.

That is Wilco's idea: instead of =r 0r X X X
use =r 0 X X X and = r X X X, that should ensure that
no partial overlap happens, just full overlap or nothing.

That's what arm_emit_coreregs_64bit_shift
and arm_ashldi3_1bit can handle.

Who will do it?

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-20 Thread rearnsha at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

--- Comment #6 from Richard Earnshaw  ---
(In reply to Bernd Edlinger from comment #5)
> (In reply to Wilco from comment #4)
> > However dealing with partial overlaps is complex so maybe the best option
> > would be to add alternatives to di3_neon to either allow full overlap
> > "r 0 X X X" or no overlap " r X  X X". The shift code works with full
> > overlap.
> 
> That sounds like a good idea.
> 
> Then this condition in di3_neon could go away too:
> 
> && (!reg_overlap_mentioned_p (operands[0], operands[1])
> || REGNO (operands[0]) == REGNO (operands[1])))

Note that we don't want to restrict complete overlaps, only partial overlaps. 
Restricting complete overlaps leads to significant increase in register
pressure and a lot of redundant copying.

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-19 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

--- Comment #5 from Bernd Edlinger  ---
(In reply to Wilco from comment #4)
> However dealing with partial overlaps is complex so maybe the best option
> would be to add alternatives to di3_neon to either allow full overlap
> "r 0 X X X" or no overlap " r X  X X". The shift code works with full
> overlap.

That sounds like a good idea.

Then this condition in di3_neon could go away too:

&& (!reg_overlap_mentioned_p (operands[0], operands[1])
|| REGNO (operands[0]) == REGNO (operands[1])))

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-19 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

--- Comment #4 from Wilco  ---
(In reply to Bernd Edlinger from comment #3)
> (In reply to Wilco from comment #2)
> > (In reply to Bernd Edlinger from comment #1)
> > > some background about this bug can be found here:
> > > 
> > > https://gcc.gnu.org/ml/gcc-patches/2016-10/msg01561.html
> > 
> > The di3_neon pattern does not constrain the input not to overlap with
> > the output for immediate shifts, and the arm_emit_coreregs_64bit_shift code
> > does not handle partial overlaps. However it is feasible to fix that by
> > swapping the order for the various cases.
> 
> from di3_neon:
> 
> if (INTVAL (operands[2]) < 1)
>   {
> emit_insn (gen_movdi (operands[0], operands[1]));
> DONE;
>   }
> 
> Will the movdi pattern work with partial overlaps?
> It does basically this:
> 
>   emit_move_insn (gen_lowpart (SImode, operands[0]),
>   gen_lowpart (SImode, operands[1]));
>   emit_move_insn (gen_highpart (SImode, operands[0]),
>   gen_highpart (SImode, operands[1]));

I think it's OK - that code only triggers if a movdi has a physical register
that is not a valid DI register which is not the case we're dealing with. movdi
has a split that does check for partial overlap around line 5900 in arm.md:

  /* Handle a partial overlap.  */
  if (rtx_equal_p (operands[0], operands[3]))
 ...

However dealing with partial overlaps is complex so maybe the best option would
be to add alternatives to di3_neon to either allow full overlap "r 0 X X
X" or no overlap " r X  X X". The shift code works with full overlap.

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-19 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

--- Comment #3 from Bernd Edlinger  ---
(In reply to Wilco from comment #2)
> (In reply to Bernd Edlinger from comment #1)
> > some background about this bug can be found here:
> > 
> > https://gcc.gnu.org/ml/gcc-patches/2016-10/msg01561.html
> 
> The di3_neon pattern does not constrain the input not to overlap with
> the output for immediate shifts, and the arm_emit_coreregs_64bit_shift code
> does not handle partial overlaps. However it is feasible to fix that by
> swapping the order for the various cases.

from di3_neon:

if (INTVAL (operands[2]) < 1)
  {
emit_insn (gen_movdi (operands[0], operands[1]));
DONE;
  }

Will the movdi pattern work with partial overlaps?
It does basically this:

  emit_move_insn (gen_lowpart (SImode, operands[0]),
  gen_lowpart (SImode, operands[1]));
  emit_move_insn (gen_highpart (SImode, operands[0]),
  gen_highpart (SImode, operands[1]));

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-19 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #2 from Wilco  ---
(In reply to Bernd Edlinger from comment #1)
> some background about this bug can be found here:
> 
> https://gcc.gnu.org/ml/gcc-patches/2016-10/msg01561.html

The di3_neon pattern does not constrain the input not to overlap with
the output for immediate shifts, and the arm_emit_coreregs_64bit_shift code
does not handle partial overlaps. However it is feasible to fix that by
swapping the order for the various cases.

[Bug rtl-optimization/78041] Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0

2016-10-19 Thread bernd.edlinger at hotmail dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78041

--- Comment #1 from Bernd Edlinger  ---
some background about this bug can be found here:

https://gcc.gnu.org/ml/gcc-patches/2016-10/msg01561.html