N900: Mapper, device deadlocks and clutter/GL errors
Hi all, I've some issue with Maemo Mapper which I cannot solve by myself and are unfortunately quite severe: - Sometimes, a "HWRecoveryResetSGX: SGX Hardware Recovery triggered" line appears on the syslog. Most of the times, without any visible effects. - Rarely, the device freezes for several seconds. It seems to me that this is solved by pressing the power key and waiting a few seconds -- but it might be just a coincidence. - Always: when I'm drawing on a texture (either loading a map tile, or using cairo on a texture) and a Hildon banner/notification appears (either from Mapper itself, or even an incoming chat notification), the texture is corrupted, and it will contain a small rectangle with pseudorandom pixels. I'm using clutter 1.0, from extras-devel. When running the application in Scratchbox i486 with valgrind, it is damn slow but I don't see any errors reported while rendering the tiles. Can you help me to debug this? I suspect it is all due to some bugs in clutter (or maybe in the SGX driver), but I have no idea where to start from. TIA, Alberto -- http://www.mardy.it <-- geek in un lingua international! ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
On Thu, 2010-03-11 at 00:32 +0200, Siarhei Siamashka wrote: > On Wednesday 10 March 2010, Laurent GUERBY wrote: > > GCC comes with some builtins for neon, they're defined in arm_neon.h > > see below. > > This does not sound like a good idea. If the code has to be modified and > changed into something nonportable, there are way better options than > intrinsics. I've no idea if this comes from a standard but ARM seems to imply arm_neon.h is supposed to be supported by various toolchains: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/ch01s04s02.html << GCC and RVCT support the same NEON intrinsic syntax, making C or C++ code portable between the toolchains. To add support for NEON intrinsics, include the header file arm_neon.h. Example 1.3 implements the same functionality as the assembler examples, using intrinsics in C code instead of assembler instructions. >> (nice test :) > But the quality of generated code is quite bad. That's also something to be > reported to gcc bugzilla :) Seems that in some limited cases GCC is making progress on neon: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43001 I'm building current SVN g++ for arm to see what it does on your code (GCC 4.4.1 get it to run in 1.5s on an 800 MHz efika MX box). Sincerely, Laurent ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
On Wednesday 10 March 2010, Laurent GUERBY wrote: > On Wed, 2010-03-10 at 21:54 +0200, Siarhei Siamashka wrote: > > I wonder why the compiler does not use real NEON instructions with > > -ffast-math option, it should be quite useful even for scalar code. > > > > something like: > > > > vld1.32 {d0[0]}, [r0] > > vadd.f32 d0, d0, d0 > > vst1.32 {d0[0]}, [r0] > > > > instead of: > > > > flds s0, [r0] > > faddss0, s0, s0 > > fsts s0, [r0] > > > > for: > > > > *float_ptr = *float_ptr + *float_ptr; > > > > At least NEON is pipelined and should be a lot faster on more complex > > code examples where it can actually benefit from pipelining. On x86, SSE2 > > is used quite nicely for floating point math. > > Hi, > > Please open a report on http://gcc.gnu.org/bugzilla with your test > sources and command line, at least GCC developpers will notice there's > interest :). This sounds reasonable :) > GCC comes with some builtins for neon, they're defined in arm_neon.h > see below. This does not sound like a good idea. If the code has to be modified and changed into something nonportable, there are way better options than intrinsics. Regarding the use of NEON instructions via C++ operator overloading. A test program is attached. # gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -ffast-math -o neon_float neon_float.cpp === ieee754 floats === real0m3.396s user0m3.391s sys 0m0.000s === runfast floats === real0m2.285s user0m2.273s sys 0m0.008s === NEON C++ wrapper === real0m1.312s user0m1.313s sys 0m0.000s But the quality of generated code is quite bad. That's also something to be reported to gcc bugzilla :) -- Best regards, Siarhei Siamashka #include #include #if 1 class fast_float { float32x2_t data; public: fast_float(float x) { data = vset_lane_f32(x, data, 0); } fast_float(const fast_float &x) { data = x.data; } fast_float(const float32x2_t &x) { data = x; } operator float () { return vget_lane_f32(data, 0); } friend fast_float operator+(const fast_float &a, const fast_float &b); friend fast_float operator*(const fast_float &a, const fast_float &b); const fast_float &operator+=(fast_float a) { data = vadd_f32(data, a.data); return *this; } }; fast_float operator+(const fast_float &a, const fast_float &b) { return vadd_f32(a.data, b.data); } fast_float operator*(const fast_float &a, const fast_float &b) { return vmul_f32(a.data, b.data); } #else typedef float fast_float; #endif float f(float *a, float *b) { int i; fast_float accumulator = 0; for (i = 0; i < 1024; i += 16) { accumulator += (fast_float)a[i + 0] * (fast_float)b[i + 0]; accumulator += (fast_float)a[i + 1] * (fast_float)b[i + 1]; accumulator += (fast_float)a[i + 2] * (fast_float)b[i + 2]; accumulator += (fast_float)a[i + 3] * (fast_float)b[i + 3]; accumulator += (fast_float)a[i + 4] * (fast_float)b[i + 4]; accumulator += (fast_float)a[i + 5] * (fast_float)b[i + 5]; accumulator += (fast_float)a[i + 6] * (fast_float)b[i + 6]; accumulator += (fast_float)a[i + 7] * (fast_float)b[i + 7]; accumulator += (fast_float)a[i + 8] * (fast_float)b[i + 8]; accumulator += (fast_float)a[i + 9] * (fast_float)b[i + 9]; accumulator += (fast_float)a[i + 10] * (fast_float)b[i + 10]; accumulator += (fast_float)a[i + 11] * (fast_float)b[i + 11]; accumulator += (fast_float)a[i + 12] * (fast_float)b[i + 12]; accumulator += (fast_float)a[i + 13] * (fast_float)b[i + 13]; accumulator += (fast_float)a[i + 14] * (fast_float)b[i + 14]; accumulator += (fast_float)a[i + 15] * (fast_float)b[i + 15]; } return accumulator; } volatile float dummy; float buf1[1024]; float buf2[1024]; int main() { int i; int tmp; __asm__ volatile( "fmrx %[tmp], fpscr\n" "orr%[tmp], %[tmp], #(1 << 24)\n" /* flush-to-zero */ "orr%[tmp], %[tmp], #(1 << 25)\n" /* default NaN */ "bic%[tmp], %[tmp], #((1 << 15) | (1 << 12) | (1 << 11) | (1 << 10) | (1 << 9) | (1 << 8))\n" /* clear exception bits */ "fmxr fpscr, %[tmp]\n" : [tmp] "=r" (tmp) ); for (i = 0; i < 1024; i++) { buf1[i] = buf2[i] = i % 16; } for (i = 0; i < 10; i++) { dummy = f(buf1, buf2); } printf("%f\n", (double)dummy); return 0; } ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
On Wednesday 10 March 2010, Laurent Desnogues wrote: > Even if fast-math is known to break some rules, it only > breaks C rules IIRC. OTOH, NEON FP has no support > for NaN and other nice things from IEEE754. And just checked gcc man page to verify this stuff. -ffast-math Sets -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans and -fcx-limited-range. -ffinite-math-only Allow optimizations for floating-point arithmetic that assume that arguments and results are not NaNs or +-Infs. This option is not turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications. So looks like -ffast-math already assumes no support for NaNs. Even if there are other nice IEEE754 things preventing NEON from being used with -ffast-math, an appropriate new option relaxing this requirement makes sense to be invented. -- Best regards, Siarhei Siamashka ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
On Wednesday 10 March 2010, Laurent Desnogues wrote: > On Wed, Mar 10, 2010 at 8:54 PM, Siarhei Siamashka > wrote: > [...] > > > I wonder why the compiler does not use real NEON instructions with > > -ffast-math option, it should be quite useful even for scalar code. > > > > something like: > > > > vld1.32 {d0[0]}, [r0] > > vadd.f32 d0, d0, d0 > > vst1.32 {d0[0]}, [r0] > > > > instead of: > > > > flds s0, [r0] > > fadds s0, s0, s0 > > fsts s0, [r0] > > > > for: > > > > *float_ptr = *float_ptr + *float_ptr; > > > > At least NEON is pipelined and should be a lot faster on more complex > > code examples where it can actually benefit from pipelining. On x86, SSE2 > > is used quite nicely for floating point math. > > Even if fast-math is known to break some rules, it only > breaks C rules IIRC. If that's the case, some other option would be handy. Or even a new custom data type like float_neon (or any other name). Probably it is even possible with C++ and operators overloading. > OTOH, NEON FP has no support > for NaN and other nice things from IEEE754. > > Anyway you're perhaps looking for -mfpu=neon, no? I lost my faith in gcc long ago :) So I'm not really looking for anything. -- Best regards, Siarhei Siamashka ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
On Wed, 2010-03-10 at 21:54 +0200, Siarhei Siamashka wrote: > I wonder why the compiler does not use real NEON instructions with > -ffast-math > option, it should be quite useful even for scalar code. > > something like: > > vld1.32 {d0[0]}, [r0] > vadd.f32 d0, d0, d0 > vst1.32 {d0[0]}, [r0] > > instead of: > > flds s0, [r0] > faddss0, s0, s0 > fsts s0, [r0] > > for: > > *float_ptr = *float_ptr + *float_ptr; > > At least NEON is pipelined and should be a lot faster on more complex code > examples where it can actually benefit from pipelining. On x86, SSE2 is used > quite nicely for floating point math. Hi, Please open a report on http://gcc.gnu.org/bugzilla with your test sources and command line, at least GCC developpers will notice there's interest :). GCC comes with some builtins for neon, they're defined in arm_neon.h see below. Sincerely, Laurent typedef struct float32x2x2_t { float32x2_t val[2]; } float32x2x2_t; ... __extension__ static __inline float32x2_t __attribute__ ((__always_inline__)) vpadd_f32 (float32x2_t __a, float32x2_t __b) { return (float32x2_t)__builtin_neon_vpaddv2sf (__a, __b, 3); } ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
On Wed, Mar 10, 2010 at 8:54 PM, Siarhei Siamashka wrote: [...] > I wonder why the compiler does not use real NEON instructions with -ffast-math > option, it should be quite useful even for scalar code. > > something like: > > vld1.32 {d0[0]}, [r0] > vadd.f32 d0, d0, d0 > vst1.32 {d0[0]}, [r0] > > instead of: > > flds s0, [r0] > fadds s0, s0, s0 > fsts s0, [r0] > > for: > > *float_ptr = *float_ptr + *float_ptr; > > At least NEON is pipelined and should be a lot faster on more complex code > examples where it can actually benefit from pipelining. On x86, SSE2 is used > quite nicely for floating point math. Even if fast-math is known to break some rules, it only breaks C rules IIRC. OTOH, NEON FP has no support for NaN and other nice things from IEEE754. Anyway you're perhaps looking for -mfpu=neon, no? Laurent ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
On Wednesday 10 March 2010, Laurent Desnogues wrote: > On Wed, Mar 10, 2010 at 7:29 PM, Alberto Mardegan > > So, it seems that there's a huge improvements when switching from doubles > > to floats; although I wonder if it's because of the FPU or just because > > the amount of data passed around is smaller. > > On the other hand, the improvements obtained by enabling the fast FPU > > mode is rather small -- but that might be due to the fact that the FPU > > operations are not a major player in this piece of code. > > The "fast" mode only gains 1 or 2 cycles per FP instruction. > The FPU on Cortex-A8 is not pipelined and the fast mode > can't change that :-) It's probably http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344j/ch16s07s01.html vs. http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344j/BCGEIHDJ.html I wonder why the compiler does not use real NEON instructions with -ffast-math option, it should be quite useful even for scalar code. something like: vld1.32 {d0[0]}, [r0] vadd.f32 d0, d0, d0 vst1.32 {d0[0]}, [r0] instead of: flds s0, [r0] faddss0, s0, s0 fsts s0, [r0] for: *float_ptr = *float_ptr + *float_ptr; At least NEON is pipelined and should be a lot faster on more complex code examples where it can actually benefit from pipelining. On x86, SSE2 is used quite nicely for floating point math. -- Best regards, Siarhei Siamashka ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
On Wednesday 10 March 2010, Alberto Mardegan wrote: > Alberto Mardegan wrote: > > Does one have any figure about how the performance of the FPU is, > > compared to integer operations? > > I added some profiling to the code, and I measured the time spent by a > function which is operating on an array of points (whose coordinates are > integers) and trasforming each of them into a geographic coordinates > (latitude and longitude, floating point) and calculating the distance > from the previous point. > > http://vcs.maemo.org/git?p=maemo-mapper;a=shortlog;h=refs/heads/gps_control > map_path_calculate_distances() is in path.c, > calculate_distance() is in utils.c, > unit2latlon() is a pointer to unit2latlon_google() in tile_source.c > > > The output (application compiled with -O0): Using an optimized build (-O2 or -O3) may sometimes change the overall picture quite dramatically. It makes almost no sense benchmarking -O0 code, because in this case all the local variables are kept in memory and are read/written before/after each operation. It's substantially different from normal code. -- Best regards, Siarhei Siamashka ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
On Wed, Mar 10, 2010 at 7:29 PM, Alberto Mardegan wrote: > Alberto Mardegan wrote: >> >> Does one have any figure about how the performance of the FPU is, compared >> to integer operations? > > I added some profiling to the code, and I measured the time spent by a > function which is operating on an array of points (whose coordinates are > integers) and trasforming each of them into a geographic coordinates > (latitude and longitude, floating point) and calculating the distance from > the previous point. > > http://vcs.maemo.org/git?p=maemo-mapper;a=shortlog;h=refs/heads/gps_control > map_path_calculate_distances() is in path.c, > calculate_distance() is in utils.c, > unit2latlon() is a pointer to unit2latlon_google() in tile_source.c > > > The output (application compiled with -O0): > > > double: > > map_path_calculate_distances: 110 ms for 8250 points > map_path_calculate_distances: 5 ms for 430 points > > map_path_calculate_distances: 109 ms for 8250 points > map_path_calculate_distances: 5 ms for 430 points > > > float: > > map_path_calculate_distances: 60 ms for 8250 points > map_path_calculate_distances: 3 ms for 430 points > > map_path_calculate_distances: 60 ms for 8250 points > map_path_calculate_distances: 3 ms for 430 points > > > float with fast FPU mode: > > map_path_calculate_distances: 50 ms for 8250 points > map_path_calculate_distances: 2 ms for 430 points > > map_path_calculate_distances: 50 ms for 8250 points > map_path_calculate_distances: 2 ms for 430 points > > > So, it seems that there's a huge improvements when switching from doubles to > floats; although I wonder if it's because of the FPU or just because the > amount of data passed around is smaller. > On the other hand, the improvements obtained by enabling the fast FPU mode > is rather small -- but that might be due to the fact that the FPU operations > are not a major player in this piece of code. The "fast" mode only gains 1 or 2 cycles per FP instruction. The FPU on Cortex-A8 is not pipelined and the fast mode can't change that :-) > One curious thing is that while making these changes, I forgot to change the > math functions to there float version, so that instead of using: > > float x, y; > x = sinf(y); > > I was using: > > float x, y; > x = sin(y); > > The timings obtained this way are surprisingly (at least to me) bad: > > map_path_calculate_distances: 552 ms for 8250 points > map_path_calculate_distances: 92 ms for 430 points > > map_path_calculate_distances: 552 ms for 8250 points > map_path_calculate_distances: 91 ms for 430 points > > Much worse than the double version. The only reason I can think of, is the > conversion from float to double and vice versa, but is it really that > expensive? This looks odd given that the 2 additional instructions take 5 and 7 cycles. > Anyway, I'll stick to using 32bit floats. :-) As long as it fits your needs that seems wise :) Laurent ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
On Wed, 2010-03-10 at 20:29 +0200, Alberto Mardegan wrote: > Alberto Mardegan wrote: > > Does one have any figure about how the performance of the FPU is, > > compared to integer operations? > > I added some profiling to the code, and I measured the time spent by a > function which is operating on an array of points (whose coordinates are > integers) and trasforming each of them into a geographic coordinates > (latitude and longitude, floating point) and calculating the distance > from the previous point. > > http://vcs.maemo.org/git?p=maemo-mapper;a=shortlog;h=refs/heads/gps_control > map_path_calculate_distances() is in path.c, > calculate_distance() is in utils.c, > unit2latlon() is a pointer to unit2latlon_google() in tile_source.c > > > The output (application compiled with -O0): > > > double: > > map_path_calculate_distances: 110 ms for 8250 points > map_path_calculate_distances: 5 ms for 430 points > > map_path_calculate_distances: 109 ms for 8250 points > map_path_calculate_distances: 5 ms for 430 points > > > float: > > map_path_calculate_distances: 60 ms for 8250 points > map_path_calculate_distances: 3 ms for 430 points > > map_path_calculate_distances: 60 ms for 8250 points > map_path_calculate_distances: 3 ms for 430 points > > > float with fast FPU mode: > > map_path_calculate_distances: 50 ms for 8250 points > map_path_calculate_distances: 2 ms for 430 points > > map_path_calculate_distances: 50 ms for 8250 points > map_path_calculate_distances: 2 ms for 430 points > > > So, it seems that there's a huge improvements when switching from > doubles to floats; although I wonder if it's because of the FPU or just > because the amount of data passed around is smaller. Right, is your experiment actually measuring floating point performance, or is that swamped out by memory accesses, or some bus transfers or something like that? > On the other hand, the improvements obtained by enabling the fast FPU > mode is rather small -- but that might be due to the fact that the FPU > operations are not a major player in this piece of code. > > One curious thing is that while making these changes, I forgot to change > the math functions to there float version, so that instead of using: > > float x, y; > x = sinf(y); > > I was using: > > float x, y; > x = sin(y); > > The timings obtained this way are surprisingly (at least to me) bad: > > map_path_calculate_distances: 552 ms for 8250 points > map_path_calculate_distances: 92 ms for 430 points > > map_path_calculate_distances: 552 ms for 8250 points > map_path_calculate_distances: 91 ms for 430 points > > Much worse than the double version. The only reason I can think of, is > the conversion from float to double and vice versa, but is it really > that expensive? > > Anyway, I'll stick to using 32bit floats. :-) > It is often hard to tell how much difference optimizing a particular operation makes. If the setup is cheaper for the slower operation, do you gain anything by using faster ops? Hard to measure sometimes. Like racing, it's not how fast you go, it's when you get there. Bernd ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
Alberto Mardegan wrote: Does one have any figure about how the performance of the FPU is, compared to integer operations? I added some profiling to the code, and I measured the time spent by a function which is operating on an array of points (whose coordinates are integers) and trasforming each of them into a geographic coordinates (latitude and longitude, floating point) and calculating the distance from the previous point. http://vcs.maemo.org/git?p=maemo-mapper;a=shortlog;h=refs/heads/gps_control map_path_calculate_distances() is in path.c, calculate_distance() is in utils.c, unit2latlon() is a pointer to unit2latlon_google() in tile_source.c The output (application compiled with -O0): double: map_path_calculate_distances: 110 ms for 8250 points map_path_calculate_distances: 5 ms for 430 points map_path_calculate_distances: 109 ms for 8250 points map_path_calculate_distances: 5 ms for 430 points float: map_path_calculate_distances: 60 ms for 8250 points map_path_calculate_distances: 3 ms for 430 points map_path_calculate_distances: 60 ms for 8250 points map_path_calculate_distances: 3 ms for 430 points float with fast FPU mode: map_path_calculate_distances: 50 ms for 8250 points map_path_calculate_distances: 2 ms for 430 points map_path_calculate_distances: 50 ms for 8250 points map_path_calculate_distances: 2 ms for 430 points So, it seems that there's a huge improvements when switching from doubles to floats; although I wonder if it's because of the FPU or just because the amount of data passed around is smaller. On the other hand, the improvements obtained by enabling the fast FPU mode is rather small -- but that might be due to the fact that the FPU operations are not a major player in this piece of code. One curious thing is that while making these changes, I forgot to change the math functions to there float version, so that instead of using: float x, y; x = sinf(y); I was using: float x, y; x = sin(y); The timings obtained this way are surprisingly (at least to me) bad: map_path_calculate_distances: 552 ms for 8250 points map_path_calculate_distances: 92 ms for 430 points map_path_calculate_distances: 552 ms for 8250 points map_path_calculate_distances: 91 ms for 430 points Much worse than the double version. The only reason I can think of, is the conversion from float to double and vice versa, but is it really that expensive? Anyway, I'll stick to using 32bit floats. :-) -- http://www.mardy.it <- geek in un lingua international! ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
Eero Tamminen wrote: Is there any performance penalty if this switch is done often? Why you would switch it off? Operations on "fast floats" aren't IEEE compatible, but as far as I've understood, they should differ only for numbers that are very close to zero, close enough that repeating your algorithm few more times would produce divide by zero even with IEEE semantics (i.e. if "fast float" causes you issues, it's indicating that there's most likely some issue in your algorithm). Ok, I thought the precision loss would be more noticeable, but as we are talking about latitude and longitude (and anyway the GPS accuracy is not so great), I guess I don't have any need to turn it off. Anyway, I'm doing some benchmarks, I'll post the results soon. -- http://www.mardy.it <- geek in un lingua international! ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
Eero Tamminen wrote: Hamalainen Kimmo (Nokia-D/Helsinki) wrote: On Wed, 2010-03-10 at 12:57 +0100, ext Alberto Mardegan wrote: Kimmo Hämäläinen wrote: You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode() in http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c Not the libosso osso_fpu_set_mode() function? I can't find this in libosso.h. :-( I'll copy Kimmo's code. -- http://www.mardy.it <- geek in un lingua international! ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
Hi, Hamalainen Kimmo (Nokia-D/Helsinki) wrote: On Wed, 2010-03-10 at 12:57 +0100, ext Alberto Mardegan wrote: Kimmo Hämäläinen wrote: You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode() in http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c Not the libosso osso_fpu_set_mode() function? - Eero ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
Hi, ext Alberto Mardegan wrote: Kimmo Hämäläinen wrote: You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode() in http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c N900 has support for NEON instructions also. This sounds interesting! Is there any performance penalty if this switch is done often? Why you would switch it off? Operations on "fast floats" aren't IEEE compatible, but as far as I've understood, they should differ only for numbers that are very close to zero, close enough that repeating your algorithm few more times would produce divide by zero even with IEEE semantics (i.e. if "fast float" causes you issues, it's indicating that there's most likely some issue in your algorithm). - Eero ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
On Wed, 2010-03-10 at 12:57 +0100, ext Alberto Mardegan wrote: > Kimmo Hämäläinen wrote: > > You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode() > > in > > http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c > > > > N900 has support for NEON instructions also. > > This sounds interesting! > > Is there any performance penalty if this switch is done often? IIRC, there was not. Leonid Moiseichuk was testing this about a year ago, and he noticed almost 50% speed-up for floats. Notice that this affects only floats, not doubles, and that there is a small accuracy penalty. -Kimmo ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
Kimmo Hämäläinen wrote: You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode() in http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c N900 has support for NEON instructions also. This sounds interesting! Is there any performance penalty if this switch is done often? Ciao, Alberto -- http://www.mardy.it <-- geek in un lingua international! ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
On Wed, 2010-03-10 at 10:46 +0100, ext Ove Kaaven wrote: > Alberto Mardegan skrev: > > Does anyone know any tricks to optimize certain operations on arrays of > > data? > > The answer to that is, obviously, to use the Cortex-A-series SIMD > engine, NEON. > > Supposedly you may be able to make gcc generate NEON instructions with > -mfpu=neon -ffast-math -ftree-vectorize (and perhaps -mfloat-abi=softfp, > but that's the default in the Fremantle SDK anyway), but it's still not > very good at it, so writing the asm by hand is still better... and I'm > not sure if it can automatically vectorize library calls like sqrt. You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode() in http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c N900 has support for NEON instructions also. -Kimmo ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
Dnia środa, 10 marca 2010 o 11:14:14 Laurent Desnogues napisał(a): > One has to be careful with that approach: Cortex-A9 SoC won't > necessarily come with a NEON SIMD unit, as it's optional. So it'd > be better to also include code that doesn't assume one has a > NEON unit. Or if someone will try to run new ver of maemo-mapper on n8x0 for example. Regards, -- JID: h...@jabber.org Website: http://marcin.juszkiewicz.com.pl/ LinkedIn: http://www.linkedin.com/in/marcinjuszkiewicz ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
On Wed, Mar 10, 2010 at 10:46 AM, Ove Kaaven wrote: > Alberto Mardegan skrev: >> Does anyone know any tricks to optimize certain operations on arrays of >> data? > > The answer to that is, obviously, to use the Cortex-A-series SIMD > engine, NEON. > > Supposedly you may be able to make gcc generate NEON instructions with > -mfpu=neon -ffast-math -ftree-vectorize (and perhaps -mfloat-abi=softfp, > but that's the default in the Fremantle SDK anyway), but it's still not > very good at it, so writing the asm by hand is still better... and I'm > not sure if it can automatically vectorize library calls like sqrt. One has to be careful with that approach: Cortex-A9 SoC won't necessarily come with a NEON SIMD unit, as it's optional. So it'd be better to also include code that doesn't assume one has a NEON unit. Laurent ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
RE: Performance of floating point instructions
> > in maemo-mapper I have a lot of code involved in doing > > transformations from latitude/longitude to Mercator > > coordinates (used in google maps, for example), calculation > > of distances, etc. > > > > I'm trying to use integer arithmetics as much as > > possible, but sometimes it's a bit impractical, and I wonder > > if it's really worth the trouble. Is the code slow at the moment and is it specifically the fp stuff that's slowing it down? If not, I'd say it's probably not worth the effort unless you're doing this for fun/out of interest. > > Does one have any figure about how the performance of > > the FPU is, compared to integer operations? > > > > A practical question: should I use this way of > > computing the square root: > > > > http://en.wikipedia.org/wiki/Methods_of_computing_square_roots > > #Binary_numeral_system_.28base_2.29 > > > > (but operating on 32 or even 64 bits), or would I be > > better using sqrtf() or sqrt()? I'd suggest writing some benchmark code for the functions you wish to compare. > > Does anyone know any tricks to optimize certain > > operations on arrays of data? There are SIMD extensions (http://www.arm.com/products/processors/technologies/dsp-simd.php). > Basically, what we did with ThinX OS, is have a full blown > soft-float toolchain which then used the already proven and > highly optimized GCC's stack floating point operations. > However , Maemo is not soft float, so I'd recommend to > experiment with rebuilding Mapper using such a soft float > enabled toolchain, statically linked to avoid glitches to > system's libc or have a seperat LD_LIBRARY_PATH to avoid > memory hogging, and see where it gets you. Soft-float is significantly slower than using the VFP hard-float (using mfpu, etc., flags on GCC on the N900 and the N8x0 for that matter), there should be emails containing benchmarks on the list from a long while back otherwise I can dig them out again. But Alberto's situation is slightly different as his integer-only code need not deal with arbitrary fp numbers (as is the case for the soft-float code) as he knows what his inputs' ranges will be, therefore he should be able to write more efficient and specialised fixed point integer functions that avoid conversion to and from fp form and that trim significant figures to the minimum he requires. Cheers, Simon ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
Alberto Mardegan skrev: > Does anyone know any tricks to optimize certain operations on arrays of > data? The answer to that is, obviously, to use the Cortex-A-series SIMD engine, NEON. Supposedly you may be able to make gcc generate NEON instructions with -mfpu=neon -ffast-math -ftree-vectorize (and perhaps -mfloat-abi=softfp, but that's the default in the Fremantle SDK anyway), but it's still not very good at it, so writing the asm by hand is still better... and I'm not sure if it can automatically vectorize library calls like sqrt. ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers
Re: Performance of floating point instructions
Hi Alberto! On Wed, Mar 10, 2010 at 9:55 AM, Alberto Mardegan < ma...@users.sourceforge.net> wrote: > Hi all, > in maemo-mapper I have a lot of code involved in doing transformations > from latitude/longitude to Mercator coordinates (used in google maps, for > example), calculation of distances, etc. > > I'm trying to use integer arithmetics as much as possible, but sometimes > it's a bit impractical, and I wonder if it's really worth the trouble. > > Does one have any figure about how the performance of the FPU is, compared > to integer operations? > > A practical question: should I use this way of computing the square root: > > > http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Binary_numeral_system_.28base_2.29 > > (but operating on 32 or even 64 bits), or would I be better using sqrtf() > or sqrt()? > > > Does anyone know any tricks to optimize certain operations on arrays of > data? > Basically, what we did with ThinX OS, is have a full blown soft-float toolchain which then used the already proven and highly optimized GCC's stack floating point operations. However , Maemo is not soft float, so I'd recommend to experiment with rebuilding Mapper using such a soft float enabled toolchain, statically linked to avoid glitches to system's libc or have a seperat LD_LIBRARY_PATH to avoid memory hogging, and see where it gets you. IMHO this is the best way to do FP optimization. We have experimented with it alot, including sqrtf and friend to no significant improvement. Sivan ___ maemo-developers mailing list maemo-developers@maemo.org https://lists.maemo.org/mailman/listinfo/maemo-developers