Re: Performance of floating point instructions

2010-03-10 Thread Sivan Greenberg
Hi Alberto!

On Wed, Mar 10, 2010 at 9:55 AM, Alberto Mardegan 
ma...@users.sourceforge.net wrote:

 Hi all,
  in maemo-mapper I have a lot of code involved in doing transformations
 from latitude/longitude to Mercator coordinates (used in google maps, for
 example), calculation of distances, etc.

 I'm trying to use integer arithmetics as much as possible, but sometimes
 it's a bit impractical, and I wonder if it's really worth the trouble.

 Does one have any figure about how the performance of the FPU is, compared
 to integer operations?

 A practical question: should I use this way of computing the square root:


 http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Binary_numeral_system_.28base_2.29

 (but operating on 32 or even 64 bits), or would I be better using sqrtf()
 or sqrt()?


 Does anyone know any tricks to optimize certain operations on arrays of
 data?



Basically, what we did with ThinX OS, is have a full blown soft-float
toolchain which then used the already proven and highly optimized GCC's
stack floating point operations. However , Maemo is not soft float, so I'd
recommend to experiment with rebuilding Mapper using such a soft float
enabled toolchain, statically linked to avoid glitches to system's libc or
have a seperat LD_LIBRARY_PATH to avoid memory hogging, and see where it
gets you.

IMHO this is the best way to do FP optimization. We have experimented with
it alot, including sqrtf and friend to no significant improvement.

Sivan
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Ove Kaaven
Alberto Mardegan skrev:
 Does anyone know any tricks to optimize certain operations on arrays of
 data?

The answer to that is, obviously, to use the Cortex-A-series SIMD
engine, NEON.

Supposedly you may be able to make gcc generate NEON instructions with
-mfpu=neon -ffast-math -ftree-vectorize (and perhaps -mfloat-abi=softfp,
but that's the default in the Fremantle SDK anyway), but it's still not
very good at it, so writing the asm by hand is still better... and I'm
not sure if it can automatically vectorize library calls like sqrt.
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


RE: Performance of floating point instructions

2010-03-10 Thread Simon Pickering

   in maemo-mapper I have a lot of code involved in doing 
  transformations from latitude/longitude to Mercator 
  coordinates (used in google maps, for example), calculation 
  of distances, etc.
  
  I'm trying to use integer arithmetics as much as 
  possible, but sometimes it's a bit impractical, and I wonder 
  if it's really worth the trouble.

Is the code slow at the moment and is it specifically the fp stuff that's
slowing it down? If not, I'd say it's probably not worth the effort unless
you're doing this for fun/out of interest.

  Does one have any figure about how the performance of 
  the FPU is, compared to integer operations?
  
  A practical question: should I use this way of 
  computing the square root:
  
  http://en.wikipedia.org/wiki/Methods_of_computing_square_roots
  #Binary_numeral_system_.28base_2.29
  
  (but operating on 32 or even 64 bits), or would I be 
  better using sqrtf() or sqrt()?

I'd suggest writing some benchmark code for the functions you wish to
compare.

  Does anyone know any tricks to optimize certain 
  operations on arrays of data?

There are SIMD extensions
(http://www.arm.com/products/processors/technologies/dsp-simd.php).

 Basically, what we did with ThinX OS, is have a full blown 
 soft-float toolchain which then used the already proven and 
 highly optimized GCC's stack floating point operations. 
 However , Maemo is not soft float, so I'd recommend to 
 experiment with rebuilding Mapper using such a soft float 
 enabled toolchain, statically linked to avoid glitches to 
 system's libc or have a seperat LD_LIBRARY_PATH to avoid 
 memory hogging, and see where it gets you.

Soft-float is significantly slower than using the VFP hard-float (using
mfpu, etc., flags on GCC on the N900 and the N8x0 for that matter), there
should be emails containing benchmarks on the list from a long while back
otherwise I can dig them out again. But Alberto's situation is slightly
different as his integer-only code need not deal with arbitrary fp numbers
(as is the case for the soft-float code) as he knows what his inputs' ranges
will be, therefore he should be able to write more efficient and specialised
fixed point integer functions that avoid conversion to and from fp form and
that trim significant figures to the minimum he requires.

Cheers,


Simon

___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Laurent Desnogues
On Wed, Mar 10, 2010 at 10:46 AM, Ove Kaaven o...@arcticnet.no wrote:
 Alberto Mardegan skrev:
 Does anyone know any tricks to optimize certain operations on arrays of
 data?

 The answer to that is, obviously, to use the Cortex-A-series SIMD
 engine, NEON.

 Supposedly you may be able to make gcc generate NEON instructions with
 -mfpu=neon -ffast-math -ftree-vectorize (and perhaps -mfloat-abi=softfp,
 but that's the default in the Fremantle SDK anyway), but it's still not
 very good at it, so writing the asm by hand is still better... and I'm
 not sure if it can automatically vectorize library calls like sqrt.

One has to be careful with that approach:  Cortex-A9 SoC won't
necessarily come with a NEON SIMD unit, as it's optional.  So it'd
be better to also include code that doesn't assume one has a
NEON unit.


Laurent
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Marcin Juszkiewicz
Dnia środa, 10 marca 2010 o 11:14:14 Laurent Desnogues napisał(a):

 One has to be careful with that approach:  Cortex-A9 SoC won't
 necessarily come with a NEON SIMD unit, as it's optional.  So it'd
 be better to also include code that doesn't assume one has a
 NEON unit.

Or if someone will try to run new ver of maemo-mapper on n8x0 for example.

Regards, 
-- 
JID:  h...@jabber.org
Website:  http://marcin.juszkiewicz.com.pl/
LinkedIn: http://www.linkedin.com/in/marcinjuszkiewicz


___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Kimmo Hämäläinen
On Wed, 2010-03-10 at 10:46 +0100, ext Ove Kaaven wrote:
 Alberto Mardegan skrev:
  Does anyone know any tricks to optimize certain operations on arrays of
  data?
 
 The answer to that is, obviously, to use the Cortex-A-series SIMD
 engine, NEON.
 
 Supposedly you may be able to make gcc generate NEON instructions with
 -mfpu=neon -ffast-math -ftree-vectorize (and perhaps -mfloat-abi=softfp,
 but that's the default in the Fremantle SDK anyway), but it's still not
 very good at it, so writing the asm by hand is still better... and I'm
 not sure if it can automatically vectorize library calls like sqrt.

You can also put the CPU to a fast floats mode, see hd_fpu_set_mode()
in
http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c

N900 has support for NEON instructions also.

-Kimmo


___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Alberto Mardegan

Kimmo Hämäläinen wrote:

You can also put the CPU to a fast floats mode, see hd_fpu_set_mode()
in
http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c

N900 has support for NEON instructions also.


This sounds interesting!

Is there any performance penalty if this switch is done often?

Ciao,
  Alberto

--
http://www.mardy.it -- geek in un lingua international!
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Kimmo Hämäläinen
On Wed, 2010-03-10 at 12:57 +0100, ext Alberto Mardegan wrote:
 Kimmo Hämäläinen wrote:
  You can also put the CPU to a fast floats mode, see hd_fpu_set_mode()
  in
  http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c
  
  N900 has support for NEON instructions also.
 
 This sounds interesting!
 
 Is there any performance penalty if this switch is done often?

IIRC, there was not. Leonid Moiseichuk was testing this about a year
ago, and he noticed almost 50% speed-up for floats. Notice that this
affects only floats, not doubles, and that there is a small accuracy
penalty.

-Kimmo


___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Eero Tamminen

Hi,

ext Alberto Mardegan wrote:

Kimmo Hämäläinen wrote:

You can also put the CPU to a fast floats mode, see hd_fpu_set_mode()
in
http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c

N900 has support for NEON instructions also.


This sounds interesting!

Is there any performance penalty if this switch is done often?


Why you would switch it off?

Operations on fast floats aren't IEEE compatible, but as far as
I've understood, they should differ only for numbers that are very close
to zero, close enough that repeating your algorithm few more times would
produce divide by zero even with IEEE semantics (i.e. if fast float
causes you issues, it's indicating that there's most likely some issue
in your algorithm).


- Eero
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Eero Tamminen

Hi,

Hamalainen Kimmo (Nokia-D/Helsinki) wrote:

On Wed, 2010-03-10 at 12:57 +0100, ext Alberto Mardegan wrote:

Kimmo Hämäläinen wrote:

You can also put the CPU to a fast floats mode, see hd_fpu_set_mode()
in
http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c


Not the libosso osso_fpu_set_mode() function?


- Eero
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Alberto Mardegan

Eero Tamminen wrote:

Hamalainen Kimmo (Nokia-D/Helsinki) wrote:

On Wed, 2010-03-10 at 12:57 +0100, ext Alberto Mardegan wrote:

Kimmo Hämäläinen wrote:

You can also put the CPU to a fast floats mode, see hd_fpu_set_mode()
in
http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c 



Not the libosso osso_fpu_set_mode() function?


I can't find this in libosso.h. :-(
I'll copy Kimmo's code.


--
http://www.mardy.it - geek in un lingua international!
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Alberto Mardegan

Eero Tamminen wrote:

Is there any performance penalty if this switch is done often?


Why you would switch it off?

Operations on fast floats aren't IEEE compatible, but as far as
I've understood, they should differ only for numbers that are very close
to zero, close enough that repeating your algorithm few more times would
produce divide by zero even with IEEE semantics (i.e. if fast float
causes you issues, it's indicating that there's most likely some issue
in your algorithm).


Ok, I thought the precision loss would be more noticeable, but as we are 
talking about latitude and longitude (and anyway the GPS accuracy is not 
so great), I guess I don't have any need to turn it off.


Anyway, I'm doing some benchmarks, I'll post the results soon.


--
http://www.mardy.it - geek in un lingua international!
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Alberto Mardegan

Alberto Mardegan wrote:
Does one have any figure about how the performance of the FPU is, 
compared to integer operations?


I added some profiling to the code, and I measured the time spent by a 
function which is operating on an array of points (whose coordinates are 
integers) and trasforming each of them into a geographic coordinates 
(latitude and longitude, floating point) and calculating the distance 
from the previous point.


http://vcs.maemo.org/git?p=maemo-mapper;a=shortlog;h=refs/heads/gps_control
map_path_calculate_distances() is in path.c,
calculate_distance() is in utils.c,
unit2latlon() is a pointer to unit2latlon_google() in tile_source.c


The output (application compiled with -O0):


double:

map_path_calculate_distances: 110 ms for 8250 points
map_path_calculate_distances: 5 ms for 430 points

map_path_calculate_distances: 109 ms for 8250 points
map_path_calculate_distances: 5 ms for 430 points


float:

map_path_calculate_distances: 60 ms for 8250 points
map_path_calculate_distances: 3 ms for 430 points

map_path_calculate_distances: 60 ms for 8250 points
map_path_calculate_distances: 3 ms for 430 points


float with fast FPU mode:

map_path_calculate_distances: 50 ms for 8250 points
map_path_calculate_distances: 2 ms for 430 points

map_path_calculate_distances: 50 ms for 8250 points
map_path_calculate_distances: 2 ms for 430 points


So, it seems that there's a huge improvements when switching from 
doubles to floats; although I wonder if it's because of the FPU or just 
because the amount of data passed around is smaller.
On the other hand, the improvements obtained by enabling the fast FPU 
mode is rather small -- but that might be due to the fact that the FPU 
operations are not a major player in this piece of code.


One curious thing is that while making these changes, I forgot to change 
the math functions to there float version, so that instead of using:


float x, y;
x = sinf(y);

I was using:

float x, y;
x = sin(y);

The timings obtained this way are surprisingly (at least to me) bad:

map_path_calculate_distances: 552 ms for 8250 points
map_path_calculate_distances: 92 ms for 430 points

map_path_calculate_distances: 552 ms for 8250 points
map_path_calculate_distances: 91 ms for 430 points

Much worse than the double version. The only reason I can think of, is 
the conversion from float to double and vice versa, but is it really 
that expensive?


Anyway, I'll stick to using 32bit floats. :-)

--
http://www.mardy.it - geek in un lingua international!
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Bernd Stramm
On Wed, 2010-03-10 at 20:29 +0200, Alberto Mardegan wrote:
 Alberto Mardegan wrote:
  Does one have any figure about how the performance of the FPU is, 
  compared to integer operations?
 
 I added some profiling to the code, and I measured the time spent by a 
 function which is operating on an array of points (whose coordinates are 
 integers) and trasforming each of them into a geographic coordinates 
 (latitude and longitude, floating point) and calculating the distance 
 from the previous point.
 
 http://vcs.maemo.org/git?p=maemo-mapper;a=shortlog;h=refs/heads/gps_control
 map_path_calculate_distances() is in path.c,
 calculate_distance() is in utils.c,
 unit2latlon() is a pointer to unit2latlon_google() in tile_source.c
 
 
 The output (application compiled with -O0):
 
 
 double:
 
 map_path_calculate_distances: 110 ms for 8250 points
 map_path_calculate_distances: 5 ms for 430 points
 
 map_path_calculate_distances: 109 ms for 8250 points
 map_path_calculate_distances: 5 ms for 430 points
 
 
 float:
 
 map_path_calculate_distances: 60 ms for 8250 points
 map_path_calculate_distances: 3 ms for 430 points
 
 map_path_calculate_distances: 60 ms for 8250 points
 map_path_calculate_distances: 3 ms for 430 points
 
 
 float with fast FPU mode:
 
 map_path_calculate_distances: 50 ms for 8250 points
 map_path_calculate_distances: 2 ms for 430 points
 
 map_path_calculate_distances: 50 ms for 8250 points
 map_path_calculate_distances: 2 ms for 430 points
 
 
 So, it seems that there's a huge improvements when switching from 
 doubles to floats; although I wonder if it's because of the FPU or just 
 because the amount of data passed around is smaller.

Right, is your experiment actually measuring floating point performance,
or is that swamped out by memory accesses, or some bus transfers or
something like that?

 On the other hand, the improvements obtained by enabling the fast FPU 
 mode is rather small -- but that might be due to the fact that the FPU 
 operations are not a major player in this piece of code.
 
 One curious thing is that while making these changes, I forgot to change 
 the math functions to there float version, so that instead of using:
 
 float x, y;
 x = sinf(y);
 
 I was using:
 
 float x, y;
 x = sin(y);
 
 The timings obtained this way are surprisingly (at least to me) bad:
 
 map_path_calculate_distances: 552 ms for 8250 points
 map_path_calculate_distances: 92 ms for 430 points
 
 map_path_calculate_distances: 552 ms for 8250 points
 map_path_calculate_distances: 91 ms for 430 points
 
 Much worse than the double version. The only reason I can think of, is 
 the conversion from float to double and vice versa, but is it really 
 that expensive?
 
 Anyway, I'll stick to using 32bit floats. :-)
 

It is often hard to tell how much difference optimizing a particular
operation makes. If the setup is cheaper for the slower operation, do
you gain anything by using faster ops? Hard to measure sometimes.

Like racing, it's not how fast you go, it's when you get there.

Bernd



___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Laurent Desnogues
On Wed, Mar 10, 2010 at 7:29 PM, Alberto Mardegan
ma...@users.sourceforge.net wrote:
 Alberto Mardegan wrote:

 Does one have any figure about how the performance of the FPU is, compared
 to integer operations?

 I added some profiling to the code, and I measured the time spent by a
 function which is operating on an array of points (whose coordinates are
 integers) and trasforming each of them into a geographic coordinates
 (latitude and longitude, floating point) and calculating the distance from
 the previous point.

 http://vcs.maemo.org/git?p=maemo-mapper;a=shortlog;h=refs/heads/gps_control
 map_path_calculate_distances() is in path.c,
 calculate_distance() is in utils.c,
 unit2latlon() is a pointer to unit2latlon_google() in tile_source.c


 The output (application compiled with -O0):


 double:

 map_path_calculate_distances: 110 ms for 8250 points
 map_path_calculate_distances: 5 ms for 430 points

 map_path_calculate_distances: 109 ms for 8250 points
 map_path_calculate_distances: 5 ms for 430 points


 float:

 map_path_calculate_distances: 60 ms for 8250 points
 map_path_calculate_distances: 3 ms for 430 points

 map_path_calculate_distances: 60 ms for 8250 points
 map_path_calculate_distances: 3 ms for 430 points


 float with fast FPU mode:

 map_path_calculate_distances: 50 ms for 8250 points
 map_path_calculate_distances: 2 ms for 430 points

 map_path_calculate_distances: 50 ms for 8250 points
 map_path_calculate_distances: 2 ms for 430 points


 So, it seems that there's a huge improvements when switching from doubles to
 floats; although I wonder if it's because of the FPU or just because the
 amount of data passed around is smaller.
 On the other hand, the improvements obtained by enabling the fast FPU mode
 is rather small -- but that might be due to the fact that the FPU operations
 are not a major player in this piece of code.

The fast mode only gains 1 or 2 cycles per FP instruction.
The FPU on Cortex-A8 is not pipelined and the fast mode
can't change that :-)

 One curious thing is that while making these changes, I forgot to change the
 math functions to there float version, so that instead of using:

 float x, y;
 x = sinf(y);

 I was using:

 float x, y;
 x = sin(y);

 The timings obtained this way are surprisingly (at least to me) bad:

 map_path_calculate_distances: 552 ms for 8250 points
 map_path_calculate_distances: 92 ms for 430 points

 map_path_calculate_distances: 552 ms for 8250 points
 map_path_calculate_distances: 91 ms for 430 points

 Much worse than the double version. The only reason I can think of, is the
 conversion from float to double and vice versa, but is it really that
 expensive?

This looks odd given that the 2 additional instructions
take 5 and 7 cycles.

 Anyway, I'll stick to using 32bit floats. :-)

As long as it fits your needs that seems wise :)


Laurent
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Siarhei Siamashka
On Wednesday 10 March 2010, Alberto Mardegan wrote:
 Alberto Mardegan wrote:
  Does one have any figure about how the performance of the FPU is,
  compared to integer operations?

 I added some profiling to the code, and I measured the time spent by a
 function which is operating on an array of points (whose coordinates are
 integers) and trasforming each of them into a geographic coordinates
 (latitude and longitude, floating point) and calculating the distance
 from the previous point.

 http://vcs.maemo.org/git?p=maemo-mapper;a=shortlog;h=refs/heads/gps_control
 map_path_calculate_distances() is in path.c,
 calculate_distance() is in utils.c,
 unit2latlon() is a pointer to unit2latlon_google() in tile_source.c


 The output (application compiled with -O0):

Using an optimized build (-O2 or -O3) may sometimes change the overall picture
quite dramatically. It makes almost no sense benchmarking -O0 code, because in
this case all the local variables are kept in memory and are read/written
before/after each operation. It's substantially different from normal code.

-- 
Best regards,
Siarhei Siamashka
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Siarhei Siamashka
On Wednesday 10 March 2010, Laurent Desnogues wrote:
 On Wed, Mar 10, 2010 at 7:29 PM, Alberto Mardegan
  So, it seems that there's a huge improvements when switching from doubles
  to floats; although I wonder if it's because of the FPU or just because
  the amount of data passed around is smaller.
  On the other hand, the improvements obtained by enabling the fast FPU
  mode is rather small -- but that might be due to the fact that the FPU
  operations are not a major player in this piece of code.

 The fast mode only gains 1 or 2 cycles per FP instruction.
 The FPU on Cortex-A8 is not pipelined and the fast mode
 can't change that :-)

It's probably
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344j/ch16s07s01.html
vs.
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344j/BCGEIHDJ.html

I wonder why the compiler does not use real NEON instructions with -ffast-math 
option, it should be quite useful even for scalar code.

something like:

vld1.32  {d0[0]}, [r0]
vadd.f32 d0, d0, d0
vst1.32  {d0[0]}, [r0]

instead of:

flds s0, [r0]
faddss0, s0, s0
fsts s0, [r0]

for:

*float_ptr = *float_ptr + *float_ptr;

At least NEON is pipelined and should be a lot faster on more complex code
examples where it can actually benefit from pipelining. On x86, SSE2 is used
quite nicely for floating point math.

-- 
Best regards,
Siarhei Siamashka
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Laurent Desnogues
On Wed, Mar 10, 2010 at 8:54 PM, Siarhei Siamashka
siarhei.siamas...@gmail.com wrote:
[...]
 I wonder why the compiler does not use real NEON instructions with -ffast-math
 option, it should be quite useful even for scalar code.

 something like:

 vld1.32  {d0[0]}, [r0]
 vadd.f32 d0, d0, d0
 vst1.32  {d0[0]}, [r0]

 instead of:

 flds     s0, [r0]
 fadds    s0, s0, s0
 fsts     s0, [r0]

 for:

 *float_ptr = *float_ptr + *float_ptr;

 At least NEON is pipelined and should be a lot faster on more complex code
 examples where it can actually benefit from pipelining. On x86, SSE2 is used
 quite nicely for floating point math.

Even if fast-math is known to break some rules, it only
breaks C rules IIRC.  OTOH, NEON FP has no support
for NaN and other nice things from IEEE754.

Anyway you're perhaps looking for -mfpu=neon, no?


Laurent
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Laurent GUERBY
On Wed, 2010-03-10 at 21:54 +0200, Siarhei Siamashka wrote:
 I wonder why the compiler does not use real NEON instructions with 
 -ffast-math 
 option, it should be quite useful even for scalar code.
 
 something like:
 
 vld1.32  {d0[0]}, [r0]
 vadd.f32 d0, d0, d0
 vst1.32  {d0[0]}, [r0]
 
 instead of:
 
 flds s0, [r0]
 faddss0, s0, s0
 fsts s0, [r0]
 
 for:
 
 *float_ptr = *float_ptr + *float_ptr;
 
 At least NEON is pipelined and should be a lot faster on more complex code
 examples where it can actually benefit from pipelining. On x86, SSE2 is used
 quite nicely for floating point math.

Hi,

Please open a report on http://gcc.gnu.org/bugzilla with your test
sources and command line, at least GCC developpers will notice there's
interest :).

GCC comes with some builtins for neon, they're defined in arm_neon.h
see below.

Sincerely,

Laurent


typedef struct float32x2x2_t
{
  float32x2_t val[2];
} float32x2x2_t;

...

__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
vpadd_f32 (float32x2_t __a, float32x2_t __b)
{
  return (float32x2_t)__builtin_neon_vpaddv2sf (__a, __b, 3);
}




___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Siarhei Siamashka
On Wednesday 10 March 2010, Laurent Desnogues wrote:
 On Wed, Mar 10, 2010 at 8:54 PM, Siarhei Siamashka
 siarhei.siamas...@gmail.com wrote:
 [...]

  I wonder why the compiler does not use real NEON instructions with
  -ffast-math option, it should be quite useful even for scalar code.
 
  something like:
 
  vld1.32  {d0[0]}, [r0]
  vadd.f32 d0, d0, d0
  vst1.32  {d0[0]}, [r0]
 
  instead of:
 
  flds     s0, [r0]
  fadds    s0, s0, s0
  fsts     s0, [r0]
 
  for:
 
  *float_ptr = *float_ptr + *float_ptr;
 
  At least NEON is pipelined and should be a lot faster on more complex
  code examples where it can actually benefit from pipelining. On x86, SSE2
  is used quite nicely for floating point math.

 Even if fast-math is known to break some rules, it only
 breaks C rules IIRC. 

If that's the case, some other option would be handy. Or even a new custom
data type like float_neon (or any other name). Probably it is even possible
with C++ and operators overloading.

 OTOH, NEON FP has no support 
 for NaN and other nice things from IEEE754.

 Anyway you're perhaps looking for -mfpu=neon, no?

I lost my faith in gcc long ago :) So I'm not really looking for anything.

-- 
Best regards,
Siarhei Siamashka
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Siarhei Siamashka
On Wednesday 10 March 2010, Laurent Desnogues wrote:
 Even if fast-math is known to break some rules, it only
 breaks C rules IIRC.  OTOH, NEON FP has no support
 for NaN and other nice things from IEEE754.

And just checked gcc man page to verify this stuff. 

-ffast-math
  Sets -fno-math-errno, -funsafe-math-optimizations,
  -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans
  and -fcx-limited-range.

-ffinite-math-only
  Allow optimizations for floating-point arithmetic that assume that arguments
  and results are not NaNs or +-Infs.

  This option is not turned on by any -O option since it can result in
  incorrect output for programs which depend on an exact implementation of
  IEEE or ISO rules/specifications for math functions. It may, however, yield
  faster code for programs that do not require the guarantees of these
  specifications.

So looks like -ffast-math already assumes no support for NaNs. Even if
there are other nice IEEE754 things preventing NEON from being used
with -ffast-math, an appropriate new option relaxing this requirement
makes sense to be invented.

-- 
Best regards,
Siarhei Siamashka
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Siarhei Siamashka
On Wednesday 10 March 2010, Laurent GUERBY wrote:
 On Wed, 2010-03-10 at 21:54 +0200, Siarhei Siamashka wrote:
  I wonder why the compiler does not use real NEON instructions with
  -ffast-math option, it should be quite useful even for scalar code.
 
  something like:
 
  vld1.32  {d0[0]}, [r0]
  vadd.f32 d0, d0, d0
  vst1.32  {d0[0]}, [r0]
 
  instead of:
 
  flds s0, [r0]
  faddss0, s0, s0
  fsts s0, [r0]
 
  for:
 
  *float_ptr = *float_ptr + *float_ptr;
 
  At least NEON is pipelined and should be a lot faster on more complex
  code examples where it can actually benefit from pipelining. On x86, SSE2
  is used quite nicely for floating point math.

 Hi,

 Please open a report on http://gcc.gnu.org/bugzilla with your test
 sources and command line, at least GCC developpers will notice there's
 interest :).

This sounds reasonable :)

 GCC comes with some builtins for neon, they're defined in arm_neon.h
 see below.

This does not sound like a good idea. If the code has to be modified and
changed into something nonportable, there are way better options than
intrinsics.

Regarding the use of NEON instructions via C++ operator overloading. A test
program is attached.

# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -ffast-math
  -o neon_float neon_float.cpp

=== ieee754 floats ===

real0m3.396s
user0m3.391s
sys 0m0.000s

=== runfast floats ===

real0m2.285s
user0m2.273s
sys 0m0.008s

=== NEON C++ wrapper ===

real0m1.312s
user0m1.313s
sys 0m0.000s

But the quality of generated code is quite bad. That's also something to be
reported to gcc bugzilla :)

-- 
Best regards,
Siarhei Siamashka
#include stdio.h
#include arm_neon.h

#if 1
class fast_float
{
float32x2_t data;
public:
fast_float(float x) { data = vset_lane_f32(x, data, 0); }
fast_float(const fast_float x) { data = x.data; }
fast_float(const float32x2_t x) { data = x; }
operator float () { return vget_lane_f32(data, 0); }

friend fast_float operator+(const fast_float a, const fast_float b);
friend fast_float operator*(const fast_float a, const fast_float b);

const fast_float operator+=(fast_float a)
{
data = vadd_f32(data, a.data);
return *this;
}
};
fast_float operator+(const fast_float a, const fast_float b)
{
return vadd_f32(a.data, b.data);
}
fast_float operator*(const fast_float a, const fast_float b)
{
return vmul_f32(a.data, b.data);
}
#else
typedef float fast_float;
#endif

float f(float *a, float *b)
{
int i;
fast_float accumulator = 0;
for (i = 0; i  1024; i += 16)
{
accumulator += (fast_float)a[i + 0] * (fast_float)b[i + 0];
accumulator += (fast_float)a[i + 1] * (fast_float)b[i + 1];
accumulator += (fast_float)a[i + 2] * (fast_float)b[i + 2];
accumulator += (fast_float)a[i + 3] * (fast_float)b[i + 3];
accumulator += (fast_float)a[i + 4] * (fast_float)b[i + 4];
accumulator += (fast_float)a[i + 5] * (fast_float)b[i + 5];
accumulator += (fast_float)a[i + 6] * (fast_float)b[i + 6];
accumulator += (fast_float)a[i + 7] * (fast_float)b[i + 7];
accumulator += (fast_float)a[i + 8] * (fast_float)b[i + 8];
accumulator += (fast_float)a[i + 9] * (fast_float)b[i + 9];
accumulator += (fast_float)a[i + 10] * (fast_float)b[i + 10];
accumulator += (fast_float)a[i + 11] * (fast_float)b[i + 11];
accumulator += (fast_float)a[i + 12] * (fast_float)b[i + 12];
accumulator += (fast_float)a[i + 13] * (fast_float)b[i + 13];
accumulator += (fast_float)a[i + 14] * (fast_float)b[i + 14];
accumulator += (fast_float)a[i + 15] * (fast_float)b[i + 15];
}
return accumulator;
}

volatile float dummy;
float buf1[1024];
float buf2[1024];

int main()
{
int i;
int tmp;
__asm__ volatile(
fmrx   %[tmp], fpscr\n
orr%[tmp], %[tmp], #(1  24)\n /* flush-to-zero */
orr%[tmp], %[tmp], #(1  25)\n /* default NaN */
bic%[tmp], %[tmp], #((1  15) | (1  12) | (1  11) | (1  10) | (1  9) | (1  8))\n /* clear exception bits */
fmxr   fpscr, %[tmp]\n
: [tmp] =r (tmp)
  );
for (i = 0; i  1024; i++)
{
buf1[i] = buf2[i] = i % 16;
}
for (i = 0; i  10; i++)
{
dummy = f(buf1, buf2);
}
printf(%f\n, (double)dummy);
return 0;
}
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


Re: Performance of floating point instructions

2010-03-10 Thread Laurent GUERBY
On Thu, 2010-03-11 at 00:32 +0200, Siarhei Siamashka wrote:
 On Wednesday 10 March 2010, Laurent GUERBY wrote:
  GCC comes with some builtins for neon, they're defined in arm_neon.h
  see below.
 
 This does not sound like a good idea. If the code has to be modified and
 changed into something nonportable, there are way better options than
 intrinsics.

I've no idea if this comes from a standard but ARM seems to imply
arm_neon.h is supposed to be supported by various toolchains:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/ch01s04s02.html

GCC and RVCT support the same NEON intrinsic syntax, making C or C++
code portable between the toolchains. To add support for NEON
intrinsics, include the header file arm_neon.h. Example 1.3 implements
the same functionality as the assembler examples, using intrinsics in C
code instead of assembler instructions.


(nice test :)

 But the quality of generated code is quite bad. That's also something to be
 reported to gcc bugzilla :)

Seems that in some limited cases GCC is making progress on neon:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43001

I'm building current SVN g++ for arm to see what it does on your code
(GCC 4.4.1 get it to run in 1.5s on an 800 MHz efika MX box).

Sincerely,

Laurent



___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers


N900: Mapper, device deadlocks and clutter/GL errors

2010-03-10 Thread Alberto Mardegan

Hi all,
  I've some issue with Maemo Mapper which I cannot solve by myself and are 
unfortunately quite severe:


- Sometimes, a HWRecoveryResetSGX: SGX Hardware Recovery triggered line 
appears on the syslog. Most of the times, without any visible effects.


- Rarely, the device freezes for several seconds. It seems to me that this is 
solved by pressing the power key and waiting a few seconds -- but it might be 
just a coincidence.


- Always: when I'm drawing on a texture (either loading a map tile, or using 
cairo on a texture) and a Hildon banner/notification appears (either from Mapper 
itself, or even an incoming chat notification), the texture is corrupted, and it 
will contain a small rectangle with pseudorandom pixels.


I'm using clutter 1.0, from extras-devel.

When running the application in Scratchbox i486 with valgrind, it is damn slow 
but I don't see any errors reported while rendering the tiles.


Can you help me to debug this? I suspect it is all due to some bugs in clutter 
(or maybe in the SGX driver), but I have no idea where to start from.


TIA,
  Alberto

--
http://www.mardy.it -- geek in un lingua international!
___
maemo-developers mailing list
maemo-developers@maemo.org
https://lists.maemo.org/mailman/listinfo/maemo-developers