Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-07-02 Thread Francesco Petrogalli
Dear all,

I just want to let you know that we just published the final version of the 
Vector Function ABI specification. The call-clobbered and call-preserved lists 
of register has been updated (see section 2.1) .

The document is located at the same address:

https://developer.arm.com/products/software-development-tools/hpc/arm-compiler-for-hpc/vector-function-abi

Kind regards,

Francesco

> On May 16, 2018, at 11:21 AM, Steve Ellcey  wrote:
> 
> On Tue, 2018-05-15 at 18:29 +, Francesco Petrogalli wrote:
> 
>> Hi Steve,
>> 
>> I am happy to let you know that the Vector Function ABI for AArch64
>> is now public and available via the link at [1].
>> 
>> Don’t hesitate to contact me in case you have any questions.
>> 
>> Kind regards,
>> 
>> Francesco
>> 
>> [1] https://developer.arm.com/products/software-development-tools/hpc
>> /arm-compiler-for-hpc/vector-function-abi
>> 
>>> 
>>> Steve Ellcey
>>> sell...@cavium.com
> 
> Thanks for publishing this Francesco, it looks like the main issue for
> GCC is that the Vector Function ABI has different caller saved / callee
> saved register conventions than the standard ARM calling convention.
> 
> If I understand things correctly, in the standard calling convention
> the callee will only save the bottom 64 bits of V8-V15 and so the
> caller needs to save those registers if it is using the top half.  In
> the Vector calling convention the callee will save all 128 bits of
> these registers (and possibly more registers) so the caller does not
> have to save these registers at all, even if it is using all 128 bits
> of them.
> 
> It doesn't look like GCC has any existing mechanism for having different
> sets of caller saved/callee saved registers depending on the function
> attributes of the calling or called function.
> 
> Changing what registers a callee function saves and restores shouldn't
> be too difficult since that can be done when generating the prologue
> and epilogue code but changing what registers a caller saves/restores
> when doing the call seems trickier.  The macro
> TARGET_HARD_REGNO_CALL_PART_CLOBBERED doesn't know anything about the
> function being called.  It returns true/false depending on just the
> register number and mode.
> 
> Steve Ellcey
> sell...@cavium.com



Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-06-11 Thread Jeff Law
On 05/31/2018 04:39 AM, Alan Hayward wrote:
> (Missed this thread initially due to incorrect email address)
Sorry.  Good to hear your're still interested in figuring this out.

> 
>> On 29 May 2018, at 11:05, Richard Sandiford  
>> wrote:
>>
>> Jeff Law  writes:
>>> Now that we're in stage1 I do want to revisit the CLOBBER_HIGH stuff.
>>> When we left things I think we were trying to decide between
>>> CLOBBER_HIGH and clobbering the appropriate subreg.  The problem with
>>> the latter is the dataflow we compute is inaccurate (overly pessimistic)
>>> so that'd have to be fixed.
> 
> Yes, I want to get back to looking at this again, however I’ve been busy
> elsewhere.
Similarly.

> 
>>
>> The clobbered part of the register in this case is a high-part subreg,
>> which is ill-formed for single registers.  It would also be difficult
>> to represent in terms of the mode, since there are no defined modes for
>> what can be stored in the high part of an SVE register.  For 128-bit
>> SVE that mode would have zero bits. :-)
>>
>> I thought the alternative suggestion was instead to have:
>>
>>   (set (reg:M X) (reg:M X))
>>
>> when X is preserved in mode M but not in wider modes.  But that seems
>> like too much of a special case to me, both in terms of the source and
>> the destination:
> 
> Agreed. When I looked at doing it that way back in Jan, my conclusion was
> that if we did it that way we end up with more or less the same code but
> instead of:
> 
> if (GET_CODE (setter) == CLOBBER_HIGH
>&& reg_is_clobbered_by_clobber_high(REGNO(dest), GET_MODE 
> (rsp->last_set_value))
> 
> Now becomes something like:
> 
> if (GET_CODE (setter) == SET
>&& REG_P (dest) && HARD_REGISTER_P (dest) && REG_P (src) && REGNO(dst) == 
> REGNO(src)
>&& reg_is_clobbered_by_self_set(REGNO(dest), GET_MODE 
> (rsp->last_set_value))
> 
> Ok, some of that code can go into a macro, but it feel much clearer to
> explicitly check for CLOBBER_HIGH rather then applying some special semantics
> to a specific SET case.
Then let's return to the CLOBBER_HIGH approach.  The hope was that most
of the places where you had to introduce CLOBBER_HIGH would "just work"
with the self-set approach.  If that's not the case, then there's really
nothing to be gained with self-set.

I suggest you get the patch updated for the trunk and repost now that
we're in broad agreement that self-set is a rathole.

jeff


Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-06-11 Thread Jeff Law
On 05/29/2018 04:05 AM, Richard Sandiford wrote:
> Jeff Law  writes:
>> Now that we're in stage1 I do want to revisit the CLOBBER_HIGH stuff.
>> When we left things I think we were trying to decide between
>> CLOBBER_HIGH and clobbering the appropriate subreg.  The problem with
>> the latter is the dataflow we compute is inaccurate (overly pessimistic)
>> so that'd have to be fixed.
> 
> The clobbered part of the register in this case is a high-part subreg,
> which is ill-formed for single registers.  It would also be difficult
> to represent in terms of the mode, since there are no defined modes for
> what can be stored in the high part of an SVE register.  For 128-bit
> SVE that mode would have zero bits. :-)
> 
> I thought the alternative suggestion was instead to have:
> 
>(set (reg:M X) (reg:M X))
You're right.  I mis-remembered.  IT happens far too often these days.

> 
> when X is preserved in mode M but not in wider modes.  But that seems
> like too much of a special case to me, both in terms of the source and
> the destination:
Well, the hope was this would "just work" without having to introduce a
new RTX code and teach all the RTL passes about it.  The self-assignment
has the right semantics, but I believe Alan showed that the DF
infrastructure pessimized it horribly.  At which point the question
became how painful would it be to fix DF and compare that to the pain of
adding a new RTX code.




> 
> - On the destination side, a SET normally provides something for later
>   instructions to use, whereas here the effect is intended to be the
>   opposite: the instruction has no effect at all on a value of mode M
>   in X.  As you say, this would pessimise df without specific handling.
>   But I think all optimisations that look for the definition of a value
>   would need to be taught to "look through" this set to find the real
>   definition of (reg:M X) (or any value of a mode no larger than M in X).
>   Very few passes use the df def-uses chains for this due its high cost.
But how often do we really need to look for the REG in a large mode than
M?  Yea, it happens occasionally, but I don't think it's pervasive and
the cases where we do probably aren't *that* important performance-wise.

Though at a conceptual level I agree.  SET is meant to provide something
for later consumption, we'd be abusing it.


> 
>   More fundamentally, it should be possible in RTL to express an
>   instruction J that *does* read X in mode M and clobbers its high part.
>   If we use the SET above to represent the clobber, and treat the rhs use
>   as special, then presumably J would need two uses of X, one "dummy" one
>   on the no-op SET and one "real" one on some other SET (or perhaps in a
>   top-level USE).  Having the number of uses determine this seems
>   a bit awkward.
> 
> IMO CLOBBER and SET have different semantics for good reason: CLOBBER
> represents an optimisation barrier for things that care about the value
> of a certain rtx object, while SET represents a productive effect or
> side-effect.  The effect we want here is the same as a normal clobber,
> except that the clobber is mode-dependent.
I largely agree.  It was really a matter of whether or not using the
self-set would simplify the implementation in a significant way.

jeff



Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-31 Thread Alan Hayward
(Missed this thread initially due to incorrect email address)

> On 29 May 2018, at 11:05, Richard Sandiford  
> wrote:
> 
> Jeff Law  writes:
>> Now that we're in stage1 I do want to revisit the CLOBBER_HIGH stuff.
>> When we left things I think we were trying to decide between
>> CLOBBER_HIGH and clobbering the appropriate subreg.  The problem with
>> the latter is the dataflow we compute is inaccurate (overly pessimistic)
>> so that'd have to be fixed.

Yes, I want to get back to looking at this again, however I’ve been busy
elsewhere.

> 
> The clobbered part of the register in this case is a high-part subreg,
> which is ill-formed for single registers.  It would also be difficult
> to represent in terms of the mode, since there are no defined modes for
> what can be stored in the high part of an SVE register.  For 128-bit
> SVE that mode would have zero bits. :-)
> 
> I thought the alternative suggestion was instead to have:
> 
>   (set (reg:M X) (reg:M X))
> 
> when X is preserved in mode M but not in wider modes.  But that seems
> like too much of a special case to me, both in terms of the source and
> the destination:

Agreed. When I looked at doing it that way back in Jan, my conclusion was
that if we did it that way we end up with more or less the same code but
instead of:

if (GET_CODE (setter) == CLOBBER_HIGH
   && reg_is_clobbered_by_clobber_high(REGNO(dest), GET_MODE 
(rsp->last_set_value))

Now becomes something like:

if (GET_CODE (setter) == SET
   && REG_P (dest) && HARD_REGISTER_P (dest) && REG_P (src) && REGNO(dst) == 
REGNO(src)
   && reg_is_clobbered_by_self_set(REGNO(dest), GET_MODE (rsp->last_set_value))

Ok, some of that code can go into a macro, but it feel much clearer to
explicitly check for CLOBBER_HIGH rather then applying some special semantics
to a specific SET case.

Alan.



Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-29 Thread Richard Sandiford
Jeff Law  writes:
> Now that we're in stage1 I do want to revisit the CLOBBER_HIGH stuff.
> When we left things I think we were trying to decide between
> CLOBBER_HIGH and clobbering the appropriate subreg.  The problem with
> the latter is the dataflow we compute is inaccurate (overly pessimistic)
> so that'd have to be fixed.

The clobbered part of the register in this case is a high-part subreg,
which is ill-formed for single registers.  It would also be difficult
to represent in terms of the mode, since there are no defined modes for
what can be stored in the high part of an SVE register.  For 128-bit
SVE that mode would have zero bits. :-)

I thought the alternative suggestion was instead to have:

   (set (reg:M X) (reg:M X))

when X is preserved in mode M but not in wider modes.  But that seems
like too much of a special case to me, both in terms of the source and
the destination:

- On the destination side, a SET normally provides something for later
  instructions to use, whereas here the effect is intended to be the
  opposite: the instruction has no effect at all on a value of mode M
  in X.  As you say, this would pessimise df without specific handling.
  But I think all optimisations that look for the definition of a value
  would need to be taught to "look through" this set to find the real
  definition of (reg:M X) (or any value of a mode no larger than M in X).
  Very few passes use the df def-uses chains for this due its high cost.

- On the source side, the instruction doesn't actually care what's in X,
  but nevertheless appears to use it.  This means that most passes would
  need to be taught that a use of X on the rhs of a no-op SET is special
  and should usually be ignored.

  More fundamentally, it should be possible in RTL to express an
  instruction J that *does* read X in mode M and clobbers its high part.
  If we use the SET above to represent the clobber, and treat the rhs use
  as special, then presumably J would need two uses of X, one "dummy" one
  on the no-op SET and one "real" one on some other SET (or perhaps in a
  top-level USE).  Having the number of uses determine this seems
  a bit awkward.

IMO CLOBBER and SET have different semantics for good reason: CLOBBER
represents an optimisation barrier for things that care about the value
of a certain rtx object, while SET represents a productive effect or
side-effect.  The effect we want here is the same as a normal clobber,
except that the clobber is mode-dependent.

Thanks,
Richard


Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-27 Thread Jeff Law
On 05/26/2018 04:09 AM, Richard Sandiford wrote:
> Steve Ellcey  writes:
>> On Wed, 2018-05-16 at 22:11 +0100, Richard Sandiford wrote:
>>>  
>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED is the only current way
>>> of saying that an rtl instruction preserves the low part of a
>>> register but clobbers the high part.  We would need something like
>>> Alan H's CLOBBER_HIGH patches to do it using explicit clobbers.
>>>
>>> Another approach would be to piggy-back on the -fipa-ra
>>> infrastructure
>>> and record that vector PCS functions only clobber Q0-Q7.  If -fipa-ra
>>> knows that a function doesn't clobber Q8-Q15 then that should
>>> override
>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  (I'm not sure whether it does
>>> in practice, but it should :-)  And if it doesn't that's a bug that's
>>> worth fixing for its own sake.)
>>>
>>> Thanks,
>>> Richard
>>
>> Alan,
>>
>> I have been looking at your CLOBBER_HIGH patches to see if they
>> might be helpful in implementing the ARM SIMD Vector ABI in GCC.
>> I have also been looking at the -fipa-ra flag and how it works.
>>
>> I was wondering if you considered using the ipa-ra infrastructure
>> for the SVE work that you are currently trying to support with 
>> the CLOBBER_HIGH macro?
>>
>> My current thought for the ABI work is to mark all the floating
>> point / vector registers as caller saved (the lower half of V8-V15
>> are currently callee saved) and remove
>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED.
>> This should work but would be inefficient.
>>
>> The next step would be to split get_call_reg_set_usage up into
>> two functions so that I don't have to pass in a default set of
>> registers.  One function would return call_used_reg_set by
>> default (but could return a smaller set if it had actual used
>> register information) and the other would return regs_invalidated
>> by_call by default (but could also return a smaller set).
>>
>> Next I would add a 'largest mode used' array to call_cgraph_rtl_info
>> structure in addition to the current function_used_regs register
>> set.
>>
>> Then I could turn the get_call_reg_set_usage replacement functions
>> into target specific functions and with the information in the
>> call_cgraph_rtl_info structure and any simd attribute information on
>> a function I could modify what registers are really being used/invalidated
>> without being saved.
>>
>> If the called function only uses the bottom half of a register it would not
>> be marked as used/invalidated.  If it uses the entire register and the
>> function is not marked as simd, then the register would marked as
>> used/invalidated.  If the function was marked as simd the register would not
>> be marked because a simd function would save both the upper and lower halves
>> of a callee saved register (whereas a non simd function would only save the
>> lower half).
>>
>> Does this sound like something that could be used in place of your 
>> CLOBBER_HIGH patch?
> 
> One of the advantages of CLOBBER_HIGH is that it can be attached to
> arbitrary instructions, not just calls.  The motivating example was
> tlsdesc_small_, which isn't treated as a call but as a normal
> instruction.  (And I don't think we want to change that, since it's much
> easier for rtl optimisers to deal with normal instructions compared to
> calls.  In general a call is part of a longer sequence of instructions
> that includes setting up arguments, etc.)
Yea.  I don't think we want to change tlsdesc*.  Representing them as
normal insns rather than calls seems reasonable to me.

Now that we're in stage1 I do want to revisit the CLOBBER_HIGH stuff.
When we left things I think we were trying to decide between
CLOBBER_HIGH and clobbering the appropriate subreg.  The problem with
the latter is the dataflow we compute is inaccurate (overly pessimistic)
so that'd have to be fixed.

Jeff


Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-26 Thread Segher Boessenkool
On Sat, May 26, 2018 at 11:09:24AM +0100, Richard Sandiford wrote:
> On the wider point about changing the way call clobber information
> is represented: I agree it would be good to generalise what we have
> now.  But if possible I think we should avoid target hooks that take
> a specific call, and instead make it an inherent part of the call insn
> itself, much like CALL_INSN_FUNCTION_USAGE is now.  E.g. we could add
> a field that points to an ABI description, with -fipa-ra effectively
> creating ad-hoc ABIs.  That ABI description could start out with
> whatever we think is relevant now and could grow over time.

Somewhat related: there still is PR68150 open for problems with
HARD_REGNO_CALL_PART_CLOBBERED in postreload-gcse (it ignores it).


Segher


Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-26 Thread Richard Sandiford
Steve Ellcey  writes:
> On Wed, 2018-05-16 at 22:11 +0100, Richard Sandiford wrote:
>> 
>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED is the only current way
>> of saying that an rtl instruction preserves the low part of a
>> register but clobbers the high part.  We would need something like
>> Alan H's CLOBBER_HIGH patches to do it using explicit clobbers.
>> 
>> Another approach would be to piggy-back on the -fipa-ra
>> infrastructure
>> and record that vector PCS functions only clobber Q0-Q7.  If -fipa-ra
>> knows that a function doesn't clobber Q8-Q15 then that should
>> override
>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  (I'm not sure whether it does
>> in practice, but it should :-)  And if it doesn't that's a bug that's
>> worth fixing for its own sake.)
>> 
>> Thanks,
>> Richard
>
> Alan,
>
> I have been looking at your CLOBBER_HIGH patches to see if they
> might be helpful in implementing the ARM SIMD Vector ABI in GCC.
> I have also been looking at the -fipa-ra flag and how it works.
>
> I was wondering if you considered using the ipa-ra infrastructure
> for the SVE work that you are currently trying to support with 
> the CLOBBER_HIGH macro?
>
> My current thought for the ABI work is to mark all the floating
> point / vector registers as caller saved (the lower half of V8-V15
> are currently callee saved) and remove
> TARGET_HARD_REGNO_CALL_PART_CLOBBERED.
> This should work but would be inefficient.
>
> The next step would be to split get_call_reg_set_usage up into
> two functions so that I don't have to pass in a default set of
> registers.  One function would return call_used_reg_set by
> default (but could return a smaller set if it had actual used
> register information) and the other would return regs_invalidated
> by_call by default (but could also return a smaller set).
>
> Next I would add a 'largest mode used' array to call_cgraph_rtl_info
> structure in addition to the current function_used_regs register
> set.
>
> Then I could turn the get_call_reg_set_usage replacement functions
> into target specific functions and with the information in the
> call_cgraph_rtl_info structure and any simd attribute information on
> a function I could modify what registers are really being used/invalidated
> without being saved.
>
> If the called function only uses the bottom half of a register it would not
> be marked as used/invalidated.  If it uses the entire register and the
> function is not marked as simd, then the register would marked as
> used/invalidated.  If the function was marked as simd the register would not
> be marked because a simd function would save both the upper and lower halves
> of a callee saved register (whereas a non simd function would only save the
> lower half).
>
> Does this sound like something that could be used in place of your 
> CLOBBER_HIGH patch?

One of the advantages of CLOBBER_HIGH is that it can be attached to
arbitrary instructions, not just calls.  The motivating example was
tlsdesc_small_, which isn't treated as a call but as a normal
instruction.  (And I don't think we want to change that, since it's much
easier for rtl optimisers to deal with normal instructions compared to
calls.  In general a call is part of a longer sequence of instructions
that includes setting up arguments, etc.)

The other use case (not implemented in the posted patches) would be
to represent the effect of syscalls, which clobber the "SVE part"
of all vector registers.  In that case the clobber would need to be
attached to an inline asm insn.

On the wider point about changing the way call clobber information
is represented: I agree it would be good to generalise what we have
now.  But if possible I think we should avoid target hooks that take
a specific call, and instead make it an inherent part of the call insn
itself, much like CALL_INSN_FUNCTION_USAGE is now.  E.g. we could add
a field that points to an ABI description, with -fipa-ra effectively
creating ad-hoc ABIs.  That ABI description could start out with
whatever we think is relevant now and could grow over time.

Thanks,
Richard


Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-24 Thread Steve Ellcey
On Wed, 2018-05-16 at 22:11 +0100, Richard Sandiford wrote:
> 
> TARGET_HARD_REGNO_CALL_PART_CLOBBERED is the only current way
> of saying that an rtl instruction preserves the low part of a
> register but clobbers the high part.  We would need something like
> Alan H's CLOBBER_HIGH patches to do it using explicit clobbers.
> 
> Another approach would be to piggy-back on the -fipa-ra
> infrastructure
> and record that vector PCS functions only clobber Q0-Q7.  If -fipa-ra
> knows that a function doesn't clobber Q8-Q15 then that should
> override
> TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  (I'm not sure whether it does
> in practice, but it should :-)  And if it doesn't that's a bug that's
> worth fixing for its own sake.)
> 
> Thanks,
> Richard

Alan,

I have been looking at your CLOBBER_HIGH patches to see if they
might be helpful in implementing the ARM SIMD Vector ABI in GCC.
I have also been looking at the -fipa-ra flag and how it works.

I was wondering if you considered using the ipa-ra infrastructure
for the SVE work that you are currently trying to support with 
the CLOBBER_HIGH macro?

My current thought for the ABI work is to mark all the floating
point / vector registers as caller saved (the lower half of V8-V15
are currently callee saved) and remove
TARGET_HARD_REGNO_CALL_PART_CLOBBERED.
This should work but would be inefficient.

The next step would be to split get_call_reg_set_usage up into
two functions so that I don't have to pass in a default set of
registers.  One function would return call_used_reg_set by
default (but could return a smaller set if it had actual used
register information) and the other would return regs_invalidated
by_call by default (but could also return a smaller set).

Next I would add a 'largest mode used' array to call_cgraph_rtl_info
structure in addition to the current function_used_regs register
set.

Then I could turn the get_call_reg_set_usage replacement functions
into target specific functions and with the information in the
call_cgraph_rtl_info structure and any simd attribute information on
a function I could modify what registers are really being used/invalidated
without being saved.

If the called function only uses the bottom half of a register it would not
be marked as used/invalidated.  If it uses the entire register and the
function is not marked as simd, then the register would marked as
used/invalidated.  If the function was marked as simd the register would not
be marked because a simd function would save both the upper and lower halves
of a callee saved register (whereas a non simd function would only save the
lower half).

Does this sound like something that could be used in place of your 
CLOBBER_HIGH patch?

Steve Ellcey
sell...@cavium.com


Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-16 Thread Richard Sandiford
Steve Ellcey  writes:
> On Wed, 2018-05-16 at 17:30 +0100, Richard Earnshaw (lists) wrote:
>> On 16/05/18 17:21, Steve Ellcey wrote:
>> > 
>> > It doesn't look like GCC has any existing mechanism for having different
>> > sets of caller saved/callee saved registers depending on the function
>> > attributes of the calling or called function.
>> > 
>> > Changing what registers a callee function saves and restores shouldn't
>> > be too difficult since that can be done when generating the prologue
>> > and epilogue code but changing what registers a caller saves/restores
>> > when doing the call seems trickier.  The macro
>> > TARGET_HARD_REGNO_CALL_PART_CLOBBERED doesn't know anything about the
>> > function being called.  It returns true/false depending on just the
>> > register number and mode.
>> > 
>> > Steve Ellcey
>> > sell...@cavium.com
>> > 
>> 
>> Actually, we can.  See, for example, the attribute((pcs)) for the ARM
>> port.  I think we could probably handle this automagically for the SVE
>> vector calling convention in AArch64.
>> 
>> R.
>
> Interesting, it looks like one could use aarch64_emit_call to emit
> extra use_reg / clobber_reg instructions but in this case we want to
> tell the caller that some registers are not being clobbered by the
> callee.  The ARM port does not
> define TARGET_HARD_REGNO_CALL_PART_CLOBBERED and that seemed like one
> of the most problamatic issues with Aarch64.  Maybe we would have to
> undefine this for aarch64 and use explicit clobbers to say what
> floating point registers / vector registers are clobbered for each
> call?  I wonder how that would affect register allocation.

TARGET_HARD_REGNO_CALL_PART_CLOBBERED is the only current way
of saying that an rtl instruction preserves the low part of a
register but clobbers the high part.  We would need something like
Alan H's CLOBBER_HIGH patches to do it using explicit clobbers.

Another approach would be to piggy-back on the -fipa-ra infrastructure
and record that vector PCS functions only clobber Q0-Q7.  If -fipa-ra
knows that a function doesn't clobber Q8-Q15 then that should override
TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  (I'm not sure whether it does
in practice, but it should :-)  And if it doesn't that's a bug that's
worth fixing for its own sake.)

Thanks,
Richard


Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-16 Thread Steve Ellcey
On Wed, 2018-05-16 at 17:30 +0100, Richard Earnshaw (lists) wrote:
> On 16/05/18 17:21, Steve Ellcey wrote:
> > 
> > It doesn't look like GCC has any existing mechanism for having different
> > sets of caller saved/callee saved registers depending on the function
> > attributes of the calling or called function.
> > 
> > Changing what registers a callee function saves and restores shouldn't
> > be too difficult since that can be done when generating the prologue
> > and epilogue code but changing what registers a caller saves/restores
> > when doing the call seems trickier.  The macro
> > TARGET_HARD_REGNO_CALL_PART_CLOBBERED doesn't know anything about the
> > function being called.  It returns true/false depending on just the
> > register number and mode.
> > 
> > Steve Ellcey
> > sell...@cavium.com
> > 
> 
> Actually, we can.  See, for example, the attribute((pcs)) for the ARM
> port.  I think we could probably handle this automagically for the SVE
> vector calling convention in AArch64.
> 
> R.

Interesting, it looks like one could use aarch64_emit_call to emit
extra use_reg / clobber_reg instructions but in this case we want to
tell the caller that some registers are not being clobbered by the
callee.  The ARM port does not
define TARGET_HARD_REGNO_CALL_PART_CLOBBERED and that seemed like one
of the most problamatic issues with Aarch64.  Maybe we would have to
undefine this for aarch64 and use explicit clobbers to say what
floating point registers / vector registers are clobbered for each
call?  I wonder how that would affect register allocation.

Steve Ellcey
sell...@cavium.com


Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-16 Thread Richard Earnshaw (lists)
On 16/05/18 17:21, Steve Ellcey wrote:
> On Tue, 2018-05-15 at 18:29 +, Francesco Petrogalli wrote:
> 
>> Hi Steve,
>>
>> I am happy to let you know that the Vector Function ABI for AArch64
>> is now public and available via the link at [1].
>>
>> Don’t hesitate to contact me in case you have any questions.
>>
>> Kind regards,
>>
>> Francesco
>>
>> [1] https://developer.arm.com/products/software-development-tools/hpc
>> /arm-compiler-for-hpc/vector-function-abi
>>
>>>
>>> Steve Ellcey
>>> sell...@cavium.com
> 
> Thanks for publishing this Francesco, it looks like the main issue for
> GCC is that the Vector Function ABI has different caller saved / callee
> saved register conventions than the standard ARM calling convention.
> 
> If I understand things correctly, in the standard calling convention
> the callee will only save the bottom 64 bits of V8-V15 and so the
> caller needs to save those registers if it is using the top half.  In
> the Vector calling convention the callee will save all 128 bits of
> these registers (and possibly more registers) so the caller does not
> have to save these registers at all, even if it is using all 128 bits
> of them.
> 
> It doesn't look like GCC has any existing mechanism for having different
> sets of caller saved/callee saved registers depending on the function
> attributes of the calling or called function.
> 
> Changing what registers a callee function saves and restores shouldn't
> be too difficult since that can be done when generating the prologue
> and epilogue code but changing what registers a caller saves/restores
> when doing the call seems trickier.  The macro
> TARGET_HARD_REGNO_CALL_PART_CLOBBERED doesn't know anything about the
> function being called.  It returns true/false depending on just the
> register number and mode.
> 
> Steve Ellcey
> sell...@cavium.com
> 


Actually, we can.  See, for example, the attribute((pcs)) for the ARM
port.  I think we could probably handle this automagically for the SVE
vector calling convention in AArch64.

R.


Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-16 Thread Steve Ellcey
On Tue, 2018-05-15 at 18:29 +, Francesco Petrogalli wrote:

> Hi Steve,
> 
> I am happy to let you know that the Vector Function ABI for AArch64
> is now public and available via the link at [1].
> 
> Don’t hesitate to contact me in case you have any questions.
> 
> Kind regards,
> 
> Francesco
> 
> [1] https://developer.arm.com/products/software-development-tools/hpc
> /arm-compiler-for-hpc/vector-function-abi
> 
> > 
> > Steve Ellcey
> > sell...@cavium.com

Thanks for publishing this Francesco, it looks like the main issue for
GCC is that the Vector Function ABI has different caller saved / callee
saved register conventions than the standard ARM calling convention.

If I understand things correctly, in the standard calling convention
the callee will only save the bottom 64 bits of V8-V15 and so the
caller needs to save those registers if it is using the top half.  In
the Vector calling convention the callee will save all 128 bits of
these registers (and possibly more registers) so the caller does not
have to save these registers at all, even if it is using all 128 bits
of them.

It doesn't look like GCC has any existing mechanism for having different
sets of caller saved/callee saved registers depending on the function
attributes of the calling or called function.

Changing what registers a callee function saves and restores shouldn't
be too difficult since that can be done when generating the prologue
and epilogue code but changing what registers a caller saves/restores
when doing the call seems trickier.  The macro
TARGET_HARD_REGNO_CALL_PART_CLOBBERED doesn't know anything about the
function being called.  It returns true/false depending on just the
register number and mode.

Steve Ellcey
sell...@cavium.com


Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-05-15 Thread Francesco Petrogalli

> On Feb 9, 2018, at 3:47 PM, Steve Ellcey  wrote:
> 
> […]
> I was wondering if the function vector ABI has been published yet and
> if so, where I could find it.
> 

Hi Steve,

I am happy to let you know that the Vector Function ABI for AArch64 is now 
public and available via the link at [1].

Don’t hesitate to contact me in case you have any questions.

Kind regards,

Francesco

[1] 
https://developer.arm.com/products/software-development-tools/hpc/arm-compiler-for-hpc/vector-function-abi

> Steve Ellcey
> sell...@cavium.com



Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2018-02-09 Thread Steve Ellcey
James,

This is a follow-up to https://gcc.gnu.org/ml/gcc/2017-03/msg00109.html
 where you said:

| Hi Ashwin,
| 
| Thanks for the question. ARM has defined a vector function ABI, based
| on the Vector Function ABI Specification you linked below, which
| is designed to be suitable for both the Advanced SIMD and Scalable
| Vector Extensions. There has not yet been a release of this document
| which I can point you at, nor can I give you an estimate of when the
| document will be published.

I was wondering if the function vector ABI has been published yet and
if so, where I could find it.

Steve Ellcey
sell...@cavium.com


Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2017-03-19 Thread Sekhar, Ashwin
On Friday 17 March 2017 07:31 PM, James Greenhalgh wrote:
> On Wed, Mar 15, 2017 at 09:50:18AM +, Sekhar, Ashwin wrote:
>> Hi GCC Team, Aarch64 Maintainers,
>>
>>
>> The rules in Vector Function Application Binary Interface Specification  for
>> OpenMP
>> (https://sourceware.org/glibc/wiki/libmvec?action=AttachFile=view=VectorABI.txt)
>> is used in x86 for generating the simd clones of a function.
>>
>> Is there a similar one defined for Aarch64?
>>
>> If not, would like to start a discussion on the same for Aarch64. To  kick
>> start the same, a draft proposal for Aarch64 (on the same lines as  x86 ABI)
>> is included below. The only change from x86 ABI is in the  function name
>> mangling. Here the letter 'b' is used for indicating the  ASIMD isa.
>
> Hi Ashwin,
>
> Thanks for the question. ARM has defined a vector function ABI, based
> on the Vector Function ABI Specification you linked below, which
> is designed to be suitable for both the Advanced SIMD and Scalable
> Vector Extensions. There has not yet been a release of this document
> which I can point you at, nor can I give you an estimate of when the
> document will be published.
>
> However, Francesco Petrogalli has recently made a proposal to the
> LLVM mailing list ( https://reviews.llvm.org/D30739 ) which I would
> note conflicts with your proposal in one way. You choose 'b' for name
> mangling for a vector function using Advanced SIMD, while Francesco
> uses 'n', which is the agreed character in the Vector Function ABI
> Specification we have been working on.
>
> I'd encourage you to wait for formal publication of the ARM Vector
> Function ABI to prevent any unexpected divergence between
> implementations.
Thanks for the information. We at Cavium are also working on libraries 
which requires this ABI specification. So we would like to see this 
published as early as possible.

>
> Thanks,
> James
>
>
Thanks
Ashwin



Re: [Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2017-03-17 Thread James Greenhalgh
On Wed, Mar 15, 2017 at 09:50:18AM +, Sekhar, Ashwin wrote:
> Hi GCC Team, Aarch64 Maintainers,
> 
> 
> The rules in Vector Function Application Binary Interface Specification  for
> OpenMP
> (https://sourceware.org/glibc/wiki/libmvec?action=AttachFile=view=VectorABI.txt)
> is used in x86 for generating the simd clones of a function.
> 
> Is there a similar one defined for Aarch64?
> 
> If not, would like to start a discussion on the same for Aarch64. To  kick
> start the same, a draft proposal for Aarch64 (on the same lines as  x86 ABI)
> is included below. The only change from x86 ABI is in the  function name
> mangling. Here the letter 'b' is used for indicating the  ASIMD isa.

Hi Ashwin,

Thanks for the question. ARM has defined a vector function ABI, based
on the Vector Function ABI Specification you linked below, which
is designed to be suitable for both the Advanced SIMD and Scalable
Vector Extensions. There has not yet been a release of this document
which I can point you at, nor can I give you an estimate of when the
document will be published.

However, Francesco Petrogalli has recently made a proposal to the
LLVM mailing list ( https://reviews.llvm.org/D30739 ) which I would
note conflicts with your proposal in one way. You choose 'b' for name
mangling for a vector function using Advanced SIMD, while Francesco
uses 'n', which is the agreed character in the Vector Function ABI
Specification we have been working on.

I'd encourage you to wait for formal publication of the ARM Vector
Function ABI to prevent any unexpected divergence between
implementations.

Thanks,
James



[Aarch64] Vector Function Application Binary Interface Specification for OpenMP

2017-03-15 Thread Sekhar, Ashwin
Hi GCC Team, Aarch64 Maintainers,


The rules in Vector Function Application Binary Interface Specification  for 
OpenMP  
(https://sourceware.org/glibc/wiki/libmvec?action=AttachFile=view=VectorABI.txt)
  is used in x86 for generating the simd clones of a function.


Is there a similar one defined for Aarch64?


If not, would like to start a discussion on the same for Aarch64. To  kick 
start the same, a draft proposal for Aarch64 (on the same lines as  x86 ABI) is 
included below. The only change from x86 ABI is in the  function name mangling. 
Here the letter 'b' is used for indicating the  ASIMD isa.


Please review and comment.


Thanks and Regards,

Ashwin Sekhar T K



 CUT HERE --




 Aarch64 Vector Function Application Binary Interface Specification for OpenMP


1. Vector Function ABI Overview

Aarch64 Vector Function ABI provides ABI for the vector functions generated by
compiler supporting SIMD constructs of OpenMP 4.0 [1] in Aarch64. This is
based on the x86 Vector Function Application Binary Interface Specification for
OpenMP [2].



2. Vector Function ABI

Vector Function ABI defines a set of rules that the caller and the callee
functions must obey.

These rules consist of:
  * Calling convention
  * Vector length (the number of concurrent scalar invocations to be processed
    per invocation of the vector function)
  * Mapping from element data types to vector data types
  * Ordering of vector arguments
  * Vector function masking
  * Vector function name mangling
  * Compiler generated variants of vector function



2.1. Calling Convention

The vector functions should use calling convention described in Procedure Call
Standard for the ARM 64-bit Architecture (AArch64) [3].



2.2. Vector Length

Every vector variant of a SIMD-enabled function has a vector length (VLEN). If
OpenMP clause "simdlen" is used, the VLEN is the value of the argument of that
clause. The VLEN value must be power of 2. In other case the notion of the
function`s "characteristic data type" (CDT) is used to compute the vector
length.

CDT is defined in the following order:
  a) For non-void function, the CDT is the return type.
  b) If the function has any non-uniform, non-linear parameters, then the CDT
 is the type of the first such parameter.
  c) If the CDT determined by a) or b) above is struct, union, or class type
 which is pass-by-value (except for the type that maps to the built-in
 complex data type), the characteristic data type is int.
  d) If none of the above three cases is applicable, the CDT is int.

VLEN  = sizeof(vector_register) / sizeof(CDT),

For example, if ISA is ASIMD, sizeof(vector_register) = 16, as the vector
registers are 128 bit. And if the CDT of the function is "int", sizeof(CDT) = 4.
So, VLEN = 4.



2.3. Element Data Type to Vector Data Type Mapping

The vector data types for parameters are selected depending on ISA, vector
length, data type of original parameter, and parameter specification.

For uniform and linear parameters (detailed description could be found in [1]),
the original data type is preserved.

For vector parameters, vector data types are selected by the compiler. The
mapping from element data type to vector data type is described as below.

  * The bit size of vector data type of parameter is computed as:

    size_of_vector_data_type = VLEN * sizeof(original_parameter_data_type) * 8

    For instance, for ASIMD version of vector function with parameter data type
    "int": If VLEN = 4, size_of_vector_data_type = 4 * 4 * 8 = 128 (bits), which
    means one argument of type __m128 to be passed.

  * If the size_of_vector_data_type is greater than the width of the vector
    register, multiple vector registers are selected and the parameter will be
    passed in multiple vector registers.

    For instance, for ASIMD version of vector function with parameter data type
    "int":

    If VLEN = 8, size_of_vector_data_type = 8 * 4 * 8 = 256 (bits), so the
    vector data type is __m256, which means 2 arguments of type __m128 are to
    be passed.



2.4. Ordering of Vector Arguments

  * When a parameter in the original data type results in one argument in the
    vector function, the ordering rule is a simple one to one match with the
    original argument order.
    
    For example, when the