Re: Tighten LRA test for reloading the inner reg of a paradoxical subreg

2018-06-11 Thread Jeff Law
On 05/30/2018 06:31 AM, Richard Sandiford wrote:
> Kyrill  Tkachov  writes:
>> Hi Richard,
>>
>> On 29/05/18 15:26, Richard Sandiford wrote:
>>> Kyrill  Tkachov  writes:
 Hi all,

 The recent changes to aarch64_expand_vector_init cause an ICE in the
 attached testcase.  The register allocator "ICEs with Max. number of
 generated reload insns per insn is achieved (90)"

 That is because aarch64_expand_vector_init creates a paradoxical
 subreg to move a DImode value
 into a V2DI vector:
 (insn 74 72 76 8 (set (reg:V2DI 287 [ _166 ])
   (subreg:V2DI (reg/v/f:DI 112 [ d ]) 0)) 1050
 {*aarch64_simd_movv2di}

 This is done because we want to express that the whole of the V2DI
 vector will be written so that init-regs doesn't try to
 zero-initialise it before we overwrite each lane individually anyway.

 This can go bad for because if the DImode value is allocated in, say,
 x30: the last register in that register class, the V2DI subreg of that
 isn't valid or meaningful and that seems to cause the trouble.

 It's kinda hard to know what the right solution for this is.
 We could emit a duplicate of the value into all the lanes of the
 vector, but we have tests that test against that
 (we're trying to avoid unnecessary duplicates)

 What this patch does is it defines a pattern for moving a scalar into
 lane 0 of a vector using a simple FMOV or LDR and represents that as a
 merging with a vector of zeroes.  That way, the instruction represents
 a full write of the destination vector but doesn't "read" more bits
 from the source than necessary. The zeroing effect is also a more
 accurate representation of the semantics of FMOV.
>>> This feels like a hack.  Either the paradoxical subreg of the pseudo
>>> is invalid for some reason (in which case we should ICE when it's formed,
>>> not just in the case of x30 being allocated) or the subreg is valid,
>>> in which case the RA should handle it correctly (and the backend should
>>> give it the information it needs to do that).
>>>
>>> I could see the argument for ignoring the problem for expediency if the
>>> patch was a clean-up in its own right, but I think it's wrong to add so
>>> much code to paper over a bug.
>>
>> I see what you mean. Do you have any thoughts on where in RA we'd go
>> about fixing this?  Since I don't know my way around RA I tried in the
>> place I was most comfortable changing :)
> 
> Ah, but everyone who's patched the RA had to patch it for the
> first time. :-)  And it's not that scary.  But anyway...
> 
> The original insn was:
> 
> (insn 74 72 76 8 (set (reg:V2DI 287 [ _166 ])
> (subreg:V2DI (reg/v/f:DI 112 [ d ]) 0)) 1060 {*aarch64_simd_movv2di}
>  (nil))
> 
> which IRA converted to:
> 
> (insn 74 72 580 8 (set (reg:V2DI 287 [ _166 ])
> (subreg:V2DI (reg/v/f:DI 517 [orig:112 d ] [112]) 0)) 1060 
> {*aarch64_simd_movv2di}
>  (nil))
> 
> after creating loop allocnos.  It happens that the ALLOCNO_WMODEs for
> both 112 and 517 were not set to V2DI due to another bug that I'll post
> a separate patch for, but we nevertheless got a valid allocation of
> register 1.
> 
> LRA's first try at constraining the instruction gave:
> 
>  Choosing alt 5 in insn 74:  (0) ?w  (1) r {*aarch64_simd_movv2di}
> 
> at which point all was good.  But LRA later decided it needed
> to spill r517:
> 
> Spill r517 after risky transformations
> 
> so the next constraint attempt gave:
> 
>  Choosing alt 0 in insn 74:  (0) =w  (1) m {*aarch64_simd_movv2di}
> 
> which was still good.  Then during inheritance we had:
> 
>   Creating newreg=672 from oldreg=517, assigning class GENERAL_REGS to 
> inheritance r672
> Original reg change 517->672 (bb8):
>74: r287:V2DI=r672:DI#0
> Add inheritance<-original before:
>   939: r672:DI=r517:DI
> 
> Inheritance reuse change 517->672 (bb8):
>   620: r572:DI=r672:DI
>   REG_DEAD r672:DI
> 
> Use smallest class of POINTER_REGS and GENERAL_REGS
>   Creating newreg=673 from oldreg=517, assigning class POINTER_REGS to 
> inheritance r673
> Original reg change 517->673 (bb8):
>   936: r669:DI=r673:DI
> Add inheritance<-original before:
>   940: r673:DI=r517:DI
> 
> ("Use smallest class of POINTER_REGS and GENERAL_REGS" ought to
> give GENERAL_REGS.  That might be a missed optimisation, and probably
> due to both classes having the same number of allocatable registers.
> I'll look at that as a follow-on.)
> 
> Thus LRA created two inheritance registers for r517, one (r673)
> that included the unallocatable x31 and another (r672) that didn't.
> The r672 references included the paradoxical subreg in insn 74 but the
> r673 ones didn't.  LRA then allocated x30 to r673, which was a valid
> choice.
> 
> Later LRA decided to "undo" the inheritance for insn 620, but because
> of the double inheritance, it got confused as to what the 

Tighten LRA test for reloading the inner reg of a paradoxical subreg

2018-05-30 Thread Richard Sandiford
Kyrill  Tkachov  writes:
> Hi Richard,
>
> On 29/05/18 15:26, Richard Sandiford wrote:
>> Kyrill  Tkachov  writes:
>>> Hi all,
>>>
>>> The recent changes to aarch64_expand_vector_init cause an ICE in the
>>> attached testcase.  The register allocator "ICEs with Max. number of
>>> generated reload insns per insn is achieved (90)"
>>>
>>> That is because aarch64_expand_vector_init creates a paradoxical
>>> subreg to move a DImode value
>>> into a V2DI vector:
>>> (insn 74 72 76 8 (set (reg:V2DI 287 [ _166 ])
>>>   (subreg:V2DI (reg/v/f:DI 112 [ d ]) 0)) 1050
>>> {*aarch64_simd_movv2di}
>>>
>>> This is done because we want to express that the whole of the V2DI
>>> vector will be written so that init-regs doesn't try to
>>> zero-initialise it before we overwrite each lane individually anyway.
>>>
>>> This can go bad for because if the DImode value is allocated in, say,
>>> x30: the last register in that register class, the V2DI subreg of that
>>> isn't valid or meaningful and that seems to cause the trouble.
>>>
>>> It's kinda hard to know what the right solution for this is.
>>> We could emit a duplicate of the value into all the lanes of the
>>> vector, but we have tests that test against that
>>> (we're trying to avoid unnecessary duplicates)
>>>
>>> What this patch does is it defines a pattern for moving a scalar into
>>> lane 0 of a vector using a simple FMOV or LDR and represents that as a
>>> merging with a vector of zeroes.  That way, the instruction represents
>>> a full write of the destination vector but doesn't "read" more bits
>>> from the source than necessary. The zeroing effect is also a more
>>> accurate representation of the semantics of FMOV.
>> This feels like a hack.  Either the paradoxical subreg of the pseudo
>> is invalid for some reason (in which case we should ICE when it's formed,
>> not just in the case of x30 being allocated) or the subreg is valid,
>> in which case the RA should handle it correctly (and the backend should
>> give it the information it needs to do that).
>>
>> I could see the argument for ignoring the problem for expediency if the
>> patch was a clean-up in its own right, but I think it's wrong to add so
>> much code to paper over a bug.
>
> I see what you mean. Do you have any thoughts on where in RA we'd go
> about fixing this?  Since I don't know my way around RA I tried in the
> place I was most comfortable changing :)

Ah, but everyone who's patched the RA had to patch it for the
first time. :-)  And it's not that scary.  But anyway...

The original insn was:

(insn 74 72 76 8 (set (reg:V2DI 287 [ _166 ])
(subreg:V2DI (reg/v/f:DI 112 [ d ]) 0)) 1060 {*aarch64_simd_movv2di}
 (nil))

which IRA converted to:

(insn 74 72 580 8 (set (reg:V2DI 287 [ _166 ])
(subreg:V2DI (reg/v/f:DI 517 [orig:112 d ] [112]) 0)) 1060 
{*aarch64_simd_movv2di}
 (nil))

after creating loop allocnos.  It happens that the ALLOCNO_WMODEs for
both 112 and 517 were not set to V2DI due to another bug that I'll post
a separate patch for, but we nevertheless got a valid allocation of
register 1.

LRA's first try at constraining the instruction gave:

 Choosing alt 5 in insn 74:  (0) ?w  (1) r {*aarch64_simd_movv2di}

at which point all was good.  But LRA later decided it needed
to spill r517:

Spill r517 after risky transformations

so the next constraint attempt gave:

 Choosing alt 0 in insn 74:  (0) =w  (1) m {*aarch64_simd_movv2di}

which was still good.  Then during inheritance we had:

  Creating newreg=672 from oldreg=517, assigning class GENERAL_REGS to 
inheritance r672
Original reg change 517->672 (bb8):
   74: r287:V2DI=r672:DI#0
Add inheritance<-original before:
  939: r672:DI=r517:DI

Inheritance reuse change 517->672 (bb8):
  620: r572:DI=r672:DI
  REG_DEAD r672:DI

Use smallest class of POINTER_REGS and GENERAL_REGS
  Creating newreg=673 from oldreg=517, assigning class POINTER_REGS to 
inheritance r673
Original reg change 517->673 (bb8):
  936: r669:DI=r673:DI
Add inheritance<-original before:
  940: r673:DI=r517:DI

("Use smallest class of POINTER_REGS and GENERAL_REGS" ought to
give GENERAL_REGS.  That might be a missed optimisation, and probably
due to both classes having the same number of allocatable registers.
I'll look at that as a follow-on.)

Thus LRA created two inheritance registers for r517, one (r673)
that included the unallocatable x31 and another (r672) that didn't.
The r672 references included the paradoxical subreg in insn 74 but the
r673 ones didn't.  LRA then allocated x30 to r673, which was a valid
choice.

Later LRA decided to "undo" the inheritance for insn 620, but because
of the double inheritance, it got confused as to what the original
situation was, and made insn 74 use the other inheritance register
instead of r517:

** Undoing inheritance #2: **

Inherit 11 out of 12 (91.67%)
   Insn after restoring regs:
  620: r572:DI=r517:DI
  REG_DEAD