[Bug target/79581] VFP4 slower than VFP3 in C-ray on Cortex A5

2017-06-15 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79581

--- Comment #8 from PeteVine  ---
I've just confirmed the result on a newer Linux distribution (Ubuntu 16.04) and
the difference between VFPv3 and v4 is clearly there (2330 vs 2560) using gcc
5.4. 

Unless the CPU itself requires an erratum, that probably leaves suboptimal
codegen as the main suspect.

[Bug target/79581] VFP4 slower than VFP3 in C-ray on Cortex A5

2017-06-15 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79581

--- Comment #7 from PeteVine  ---
Thanks, I promise to test any patches without delay :)

[Bug target/79581] VFP4 slower than VFP3 in C-ray on Cortex A5

2017-06-15 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79581

Ramana Radhakrishnan  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-06-15
 CC||ramana at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #6 from Ramana Radhakrishnan  ---
(In reply to PeteVine from comment #4)
> > Judging by your -mcpu option is this on a Cortex-A5?
> 
> Yes, if you look at the results on a Cortex A53 running armv7 code, it
> doesn't reproduce either, and A5-codegen is king :) (hopefully due to
> in-order design or sth)
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53659#c12
> 
> A quick question regarding -mcpu=cortex-a5 codegen; is there a similar
> switch to llvm's `-slowfpvmlx` feature? (disable slow vmla/vmls), which the
> nice ARM guy divulged here:
> 
> https://bugs.llvm.org//show_bug.cgi?id=26135#c9  
> 
> or is it a non-issue in gcc?

No, there isn't a similar switch to that in GCC that I'm aware of.

It's also not yet clear why the change is - the only difference is as per
kyrill's analysis in c#3

I'm going to confirm this but I don't have Cortex-A5 hardware to investigate /
play with this any further.

[Bug target/79581] VFP4 slower than VFP3 in C-ray on Cortex A5

2017-05-01 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79581

--- Comment #5 from PeteVine  ---
Unchanged in gcc version 8.0.0 20170501.

[Bug target/79581] VFP4 slower than VFP3 in C-ray

2017-02-20 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79581

--- Comment #4 from PeteVine  ---
> Judging by your -mcpu option is this on a Cortex-A5?

Yes, if you look at the results on a Cortex A53 running armv7 code, it doesn't
reproduce either, and A5-codegen is king :) (hopefully due to in-order design
or sth)

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53659#c12

A quick question regarding -mcpu=cortex-a5 codegen; is there a similar switch
to llvm's `-slowfpvmlx` feature? (disable slow vmla/vmls), which the nice ARM
guy divulged here:

https://bugs.llvm.org//show_bug.cgi?id=26135#c9  

or is it a non-issue in gcc?

[Bug target/79581] VFP4 slower than VFP3 in C-ray

2017-02-20 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79581

--- Comment #3 from ktkachov at gcc dot gnu.org ---
I can't reproduce the difference on my machine. 
Judging by your -mcpu option is this on a Cortex-A5?

As far as codegen goes the major difference I can see is that the vfpv4 version
generates vfma instructions instead of vmla ones.

Also there are cases where the vfpv3 version will generate multiple vmls
instructions whereas the vfpv4 one will generate an explicit vneg followed by
vfma instructions

[Bug target/79581] VFP4 slower than VFP3 in C-ray

2017-02-18 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79581

--- Comment #2 from PeteVine  ---
Created attachment 40769
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40769=edit
sphract

The other file required to run the benchmark straight from bugzilla! :)

[Bug target/79581] VFP4 slower than VFP3 in C-ray

2017-02-17 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79581

PeteVine  changed:

   What|Removed |Added

 Target||armv7

--- Comment #1 from PeteVine  ---
Distilled from PR79105 as a separate issue, not related to NEON and
autovectorization.