On Wed, 2010-06-16 at 08:09 -0700, Andrew Pinski wrote:
> 
> Sent from my iPhone
> 
> On Jun 16, 2010, at 6:04 AM, Richard Guenther <richard.guent...@gmail.com 
>  > wrote:
> 
> > On Wed, Jun 16, 2010 at 5:52 PM, Siarhei Siamashka
> > <siarhei.siamas...@gmail.com> wrote:
> >> Hello,
> >>
> >> Currently gcc (at least version 4.5.0) does a very poor job  
> >> generating single
> >> precision floating point code for ARM Cortex-A8.
> >>
> >> The source of this problem is the use of VFP instructions which are  
> >> run on a
> >> slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on  
> >> RunFast mode
> >> (flush denormals to zero, disable exceptions) just provides a  
> >> relatively minor
> >> performance gain.
> >>
> >> The right solution seems to be the use of NEON instructions for  
> >> doing most of
> >> the single precision calculations.
> >>
> >> I wonder if it would be difficult to introduce the following  
> >> changes to the
> >> gcc generated code when optimizing for cortex-a8:
> >> 1. Allocate single precision variables only to evenly or oddly  
> >> numbered
> >> s-registers.
> >> 2. Instead of using 'fadds s0, s0, s2' or similar instructions, do
> >> 'vadd.f32 d0, d0, d1' instead.
> >>
> >> The number of single precision floating point registers gets  
> >> effectively
> >> halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky
> >> (packing/unpacking of register pairs may be needed to ensure proper  
> >> parameters
> >> passing to functions). Also there may be other problems, like  
> >> dealing with
> >> strict IEEE-754 compliance (maybe a special variable attribute for  
> >> relaxing
> >> compliance requirements could be useful). But this looks like the  
> >> only
> >> solution to fix poor performance on ARM Cortex-A8 processor.
> >>
> >> Actually clang 2.7 seems to be working exactly this way. And it is
> >> outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single  
> >> precision
> >> floating point tests that I tried on ARM Cortex-A8.
> >
> > On i?86 we have -mfpmath={sse,x87}, I suppose you could add
> > -mfpmath=neon for arm (properly conflicting with -mfloat-abi=hard
> > and requiring neon support).
> 
> Except unlike sse, neon does not fully support IEEE support. So this  
> should only be done with -ffast-math :). The point that it is slow is  
> not good enough to change it to be something that is wrong and fast.
> 

We could document -mfpmath=neon as implying fast-math (or at least, no
denormals and default NaNs).  If the user explicitly asks for floating
point to be done via the Neon unit, it would be somewhat churlish to say
"I don't believe you know what you're asking for... so I'll ignore you".

This might be a better approach than just using Neon automatically when
fast math.  On some cores a mix of double and single-precision code
might run quite slowly if single precision was done in Neon and double
precision on the VFP unit.

R.


Reply via email to