Re: [GRASS-user] r.neighbors velocity

2013-07-21 Thread Sören Gebbert
Hi Ivan,

2013/7/17 Ivan Marchesini ivan.marches...@gmail.com:
 Dear Soeren, Hamish, Markus M., Markus N,.
 sorry for the delay in the answer but I didn't believe that my question
 could have determined so many answers and I was on holiday for 10 days
 with no way to test your code.
 First of all a quick answer to Markus N.
 My collegue is working on high resolution images and DEMs. In this case,
 in particular, he is working on a 1 meter resolution raster map
 concerning landslides. He said me that, due to the landslides
 dimensions, he need to use a kernel of 501*501 in order to catch the
 signatures of the phenomena (I don't know exactly the details.. I'm
 sorry).

I think it will be good if you could ask your colleague for more details,
since we are all very curious about his computational approach.

 I'm not a C developer but reading your e-mails it seems that the
 performances of the C codes (and r.neigbors is written in C) strongly
 depend on the compiler.
 does it mean that compiling in a different way (I don't know how) the
 r.neigbours module we can obtain better results?

 We have tested the last code of Soeren in the same machine where the
 proprietary software (it is ENVI 5.x) showed those good performances.

The performance is good indeed. Unfortunately you need to set the
moving windows size from 23 to 501, to receive a comparable result to
the computation of your colleague.

./neighbor 5000 5000 501

This will call the neíghbour computation with 25,000,000 cells and a
moving window with 501x501 pixel.

Best regards
Soeren

 These are the results:

 gcc -Wall -fopenmp -lgomp -Ofast main.c -o neighbor
 export OMP_NUM_THREADS=1
 time ./neighbor 5000 5000 23
 real0m16.598s
 user0m16.477s
 sys0m0.080s

 export OMP_NUM_THREADS=2
 time ./neighbor 5000 5000 23
 real0m8.977s
 user0m17.573s
 sys0m0.080s

 export OMP_NUM_THREADS=4
 time ./neighbor 5000 5000 23
 real0m5.993s
 user0m20.277s
 sys0m0.088s

 export OMP_NUM_THREADS=6
 time ./neighbor 5000 5000 23
 real0m4.784s
 user0m25.770s
 sys0m0.096s


 Many thanks for your answers





___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user


Re: [GRASS-user] r.neighbors velocity

2013-07-17 Thread Ivan Marchesini
Dear Soeren, Hamish, Markus M., Markus N,.
sorry for the delay in the answer but I didn't believe that my question
could have determined so many answers and I was on holiday for 10 days
with no way to test your code.
First of all a quick answer to Markus N.
My collegue is working on high resolution images and DEMs. In this case,
in particular, he is working on a 1 meter resolution raster map
concerning landslides. He said me that, due to the landslides
dimensions, he need to use a kernel of 501*501 in order to catch the
signatures of the phenomena (I don't know exactly the details.. I'm
sorry).

I'm not a C developer but reading your e-mails it seems that the
performances of the C codes (and r.neigbors is written in C) strongly
depend on the compiler.
does it mean that compiling in a different way (I don't know how) the
r.neigbours module we can obtain better results?

We have tested the last code of Soeren in the same machine where the
proprietary software (it is ENVI 5.x) showed those good performances.  

These are the results:

gcc -Wall -fopenmp -lgomp -Ofast main.c -o neighbor
export OMP_NUM_THREADS=1
time ./neighbor 5000 5000 23
real0m16.598s
user0m16.477s
sys0m0.080s

export OMP_NUM_THREADS=2
time ./neighbor 5000 5000 23
real0m8.977s
user0m17.573s
sys0m0.080s

export OMP_NUM_THREADS=4
time ./neighbor 5000 5000 23
real0m5.993s
user0m20.277s
sys0m0.088s

export OMP_NUM_THREADS=6
time ./neighbor 5000 5000 23
real0m4.784s
user0m25.770s
sys0m0.096s


Many thanks for your answers
 
 
 


___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user


Re: [GRASS-user] r.neighbors velocity

2013-06-29 Thread Hamish
Hi,

here are the same results for Soeren's test program, with the Open64
compiler from AMD:

 - Same AMD X6 CPU as below.
 - Open64 compiler 4.5.2.1 from AMD  (GPLv2, LGPL)

I just downloaded the pre-built RHEL5 binary tarball and they worked
on Debian/squeeze, I just made an alias to the executable in the un-
tarred bin/ dir to get it to work.
 see also http://wiki.open64.net/index.php/Installation_on_Ubuntu
Source is available of course, but according to the Debian ITP ticket
it's a bit of a pain to build there.


straight opencc:

real  0m59.015s | 0m58.972s | 0m58.963s
user  0m58.760s | 0m58.812s | 0m58.624s
sys   0m0.248s  | 0m0.136s  | 0m0.300s
--

opencc -O3:

real    0m35.203s | 0m35.173s | 0m35.204s
user    0m35.206s | 0m35.174s | 0m35.206s
sys     0m0.000s  | 0m0.000s  | 0m0.000s
--

opencc -Ofast (with or without -march=auto for native bytecode)

real  0m13.389s | 0m13.402s | 0m13.435s
user  0m13.389s | 0m13.405s | 0m13.437s
sys   0m0.000s  | 0m0.000s  | 0m0.000s
--

opencc -Ofast -march=auto -apo on a 6-(real)-core CPU
v is 2.09131e+13

real  0m2.552s  | 0m2.595s  | 0m2.591s
user  0m14.857s | 0m14.725s | 0m14.725s
sys   0m0.008s  | 0m0.024s  | 0m0.016s


'-apo' is autoparallelization, poorly documented, but it works!
it adds OpenMP pragmas where it thinks it can  where it will
cause a gain; I'm glad to see it's not just for the fotran
compiler anymore.


So the Open64 compiler is not quite as fast as Intel's one for this
test case, but it's pretty close versus the more versatile gcc in the
far distance. Executable file size for all of the above was less than
12kb, since it can link to local OS shared libs.

I haven't tried it with llvm/clang.

Now I wonder which flags to use to recreate -Ofast in gcc to make it
a fairer comparison..


Hamish


 I also ran it on an AMD Phenom II X6 1090T  (icc -xHost -- -xSSSE3 ?)
 All times real; all output was v is 2.09131e+13.
 
 gcc 4.4.5 with standard-opts: 7kb binary
  == near parity single-threaded performance with the new i7 chip from
 the 2 year old AMD Phenom and older copy of gcc! (stock debian/squeeze)
   1m16.175s | 1m15.634s | 1m16.029s
 
 icc 12.1 with standard-opts:
   0m32.975s | 0m33.079s | 0m33.249s
 
 icc with -fast opt: (700kb binary)
   0m9.577s | 0m9.572s | 0m9.583s
 
 icc with -parallel auto-MP: (31kb binary)
  == again near parity with the new i7 chip! even with the Intel-biased
 compiler.  user cpu-time was actually less. the advantage of 6 real
 cores vs 4 real+4virtual ones.*
   0m6.406s  | 0m6.404s  | 0m6.404s
   0m37.106s | 0m37.170s | 0m37.106s
   0m0.044s  | 0m0.040s  | 0m0.028s
 
 icc with -fast and -parallel: (2mb binary)
   0m2.002s  | 0m2.002s  | 0m2.002s
   0m10.765s | 0m10.769s | 0m10.769s
   0m0.016s  | 0m0.012s  | 0m0.008s
___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user


Re: [GRASS-user] r.neighbors velocity

2013-06-29 Thread Markus Metz
Some more results with Sören's test program on a Intel(R) Core(TM) i5
CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and
clang 3.3

gcc -O3
v is 2.09131e+13

real2m0.393s
user1m57.610s
sys 0m0.003s

gcc -Ofast
v is 2.09131e+13

real0m7.218s
user0m7.018s
sys 0m0.017s

gcc -Ofast -floop-parallelize-all is as fast as gcc -Ofast

clang -Ofast
v is 2.09131e+13

real0m18.701s
user0m18.285s
sys 0m0.000s

Markus M


On Sat, Jun 29, 2013 at 8:35 AM, Hamish hamis...@yahoo.com wrote:
 Hi,

 here are the same results for Soeren's test program, with the Open64
 compiler from AMD:

  - Same AMD X6 CPU as below.
  - Open64 compiler 4.5.2.1 from AMD  (GPLv2, LGPL)

 I just downloaded the pre-built RHEL5 binary tarball and they worked
 on Debian/squeeze, I just made an alias to the executable in the un-
 tarred bin/ dir to get it to work.
  see also http://wiki.open64.net/index.php/Installation_on_Ubuntu
 Source is available of course, but according to the Debian ITP ticket
 it's a bit of a pain to build there.


 straight opencc:

 real  0m59.015s | 0m58.972s | 0m58.963s
 user  0m58.760s | 0m58.812s | 0m58.624s
 sys   0m0.248s  | 0m0.136s  | 0m0.300s
 --

 opencc -O3:

 real0m35.203s | 0m35.173s | 0m35.204s
 user0m35.206s | 0m35.174s | 0m35.206s
 sys 0m0.000s  | 0m0.000s  | 0m0.000s
 --

 opencc -Ofast (with or without -march=auto for native bytecode)

 real  0m13.389s | 0m13.402s | 0m13.435s
 user  0m13.389s | 0m13.405s | 0m13.437s
 sys   0m0.000s  | 0m0.000s  | 0m0.000s
 --

 opencc -Ofast -march=auto -apo on a 6-(real)-core CPU
 v is 2.09131e+13

 real  0m2.552s  | 0m2.595s  | 0m2.591s
 user  0m14.857s | 0m14.725s | 0m14.725s
 sys   0m0.008s  | 0m0.024s  | 0m0.016s


 '-apo' is autoparallelization, poorly documented, but it works!
 it adds OpenMP pragmas where it thinks it can  where it will
 cause a gain; I'm glad to see it's not just for the fotran
 compiler anymore.


 So the Open64 compiler is not quite as fast as Intel's one for this
 test case, but it's pretty close versus the more versatile gcc in the
 far distance. Executable file size for all of the above was less than
 12kb, since it can link to local OS shared libs.

 I haven't tried it with llvm/clang.

 Now I wonder which flags to use to recreate -Ofast in gcc to make it
 a fairer comparison..


 Hamish


 I also ran it on an AMD Phenom II X6 1090T  (icc -xHost -- -xSSSE3 ?)
 All times real; all output was v is 2.09131e+13.

 gcc 4.4.5 with standard-opts: 7kb binary
  == near parity single-threaded performance with the new i7 chip from
 the 2 year old AMD Phenom and older copy of gcc! (stock debian/squeeze)
   1m16.175s | 1m15.634s | 1m16.029s

 icc 12.1 with standard-opts:
   0m32.975s | 0m33.079s | 0m33.249s

 icc with -fast opt: (700kb binary)
   0m9.577s | 0m9.572s | 0m9.583s

 icc with -parallel auto-MP: (31kb binary)
  == again near parity with the new i7 chip! even with the Intel-biased
 compiler.  user cpu-time was actually less. the advantage of 6 real
 cores vs 4 real+4virtual ones.*
   0m6.406s  | 0m6.404s  | 0m6.404s
   0m37.106s | 0m37.170s | 0m37.106s
   0m0.044s  | 0m0.040s  | 0m0.028s

 icc with -fast and -parallel: (2mb binary)
   0m2.002s  | 0m2.002s  | 0m2.002s
   0m10.765s | 0m10.769s | 0m10.769s
   0m0.016s  | 0m0.012s  | 0m0.008s
 ___
 grass-user mailing list
 grass-user@lists.osgeo.org
 http://lists.osgeo.org/mailman/listinfo/grass-user
___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user


Re: [GRASS-user] r.neighbors velocity

2013-06-29 Thread Hamish
Markus Metz wrote:

 Some more results with Sören's test program on a Intel(R) Core(TM) i5
 CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and
 clang 3.3
 
 gcc -O3
 v is 2.09131e+13
 
 real    2m0.393s
 user    1m57.610s
 sys    0m0.003s
 
 gcc -Ofast
 v is 2.09131e+13
 
 real    0m7.218s
 user    0m7.018s
 sys    0m0.017s


nice. one thing we need to remember though is that it's not entirely
free, one thing -Ofast turns on is -ffast-math, 

 This option is not turned on by any -O option besides -Ofast since it can
 result in incorrect output for programs that depend on an exact
 implementation of IEEE or ISO rules/specifications for math functions. It
 may, however, yield faster code for programs that do not require the
 guarantees of these specifications.


which may not be fit for our purposes.


With the ifort compiler there is '-fp-model precise' which allows only
optimizations which don't harm the results. Maybe gcc has something
similar.

Glad to see -floop-parallelize-all in gcc 4.7, it will help us identify
places to focus OpenMP work on.


Hamish

___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user


Re: [GRASS-user] r.neighbors velocity

2013-06-29 Thread Markus Metz
On Sat, Jun 29, 2013 at 1:26 PM, Hamish hamis...@yahoo.com wrote:
 Markus Metz wrote:

 Some more results with Sören's test program on a Intel(R) Core(TM) i5
 CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and
 clang 3.3

 gcc -O3
 v is 2.09131e+13

 real2m0.393s
 user1m57.610s
 sys0m0.003s

 gcc -Ofast
 v is 2.09131e+13

 real0m7.218s
 user0m7.018s
 sys0m0.017s


 nice. one thing we need to remember though is that it's not entirely
 free, one thing -Ofast turns on is -ffast-math,
 
  This option is not turned on by any -O option besides -Ofast since it can
  result in incorrect output for programs that depend on an exact
  implementation of IEEE or ISO rules/specifications for math functions. It
  may, however, yield faster code for programs that do not require the
  guarantees of these specifications.
 

 which may not be fit for our purposes.


 With the ifort compiler there is '-fp-model precise' which allows only
 optimizations which don't harm the results. Maybe gcc has something
 similar.

In gcc, you can turn of -ffoo with -fno-foo, maybe this way you can
use -Ofast -fno-fast-math to preserve IEEE specifications.

 Glad to see -floop-parallelize-all in gcc 4.7, it will help us identify
 places to focus OpenMP work on.


 Hamish

___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user


Re: [GRASS-user] r.neighbors velocity

2013-06-29 Thread Sören Gebbert
Hey Folks,
many thanks for pointing to the important influence of different compiler
and compiler options. But please be aware that my tiny little program is
not representative for a neighbor analysis implementation, it was simply a
demonstration of  12 billion ops:

1. I use fixed loop sizes, it is really easy for a compiler to optimize that
2. It is pretty simple to parallelize since only a simple reduction is done
in the inner loop

3. Most important: The statement of Ivan was a window size of 501 ... as
MarkusN IMHO correctly interpreted this leads to a moving window of 501x501
pixel if this is an option for r.neighbors. It is not the total number of
cells of a rectangular moving window, since it must must be an even number
in this case. Other shapes than rectangular are more complex to implement.

To be diplomatic i decided to use 501 pixel, which might represented a
23x21 pixel moving window, to show that this small number of operations
needs a considerable amount of time on modern CPU's.

If you use a 501x501 pixel moving window the computational effort is
roughly 501 times 12 billion ops. IMHO in this case a GPU or neighbor
algorithm specific FPGA/ASIC may be able to perform this operation in 2/3
seconds.

Best regards
Soeren
___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user


Re: [GRASS-user] r.neighbors velocity

2013-06-29 Thread Sören Gebbert
Hi,
i have implemented a real average neighborhood algorithm that runs in
parallel using openmp. The source code and the benchmark shell script is
attached.

The neighbor program computes the average moving window of arbitrary size.
The size of the map rows x cols and the size of the moving window  (odd
number cols==rows) can be specified.

./neighbor rows cols mw_size

IMHO the new program is better for compiler comparison and neighborhood
operation performance.

This is the benchmark on my 5 year old AMD phenom 4 core computer using 1,
2 and 4 threads:

gcc -Wall -fopenmp -lgomp -Ofast main.c -o neighbor
export OMP_NUM_THREADS=1
time ./neighbor 5000 5000 23
real 0m37.211s
user 0m36.998s
sys 0m0.196s

export OMP_NUM_THREADS=2
time ./neighbor 5000 5000 23
real 0m19.907s
user 0m38.890s
sys 0m0.248s

export OMP_NUM_THREADS=4
time ./neighbor 5000 5000 23
real 0m10.170s
user 0m38.466s
sys 0m0.192s

Happy hacking, compiling and testing. :)

Best regards
Soeren




2013/6/29 Markus Metz markus.metz.gisw...@gmail.com

 On Sat, Jun 29, 2013 at 1:26 PM, Hamish hamis...@yahoo.com wrote:
  Markus Metz wrote:
 
  Some more results with Sören's test program on a Intel(R) Core(TM) i5
  CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and
  clang 3.3
 
  gcc -O3
  v is 2.09131e+13
 
  real2m0.393s
  user1m57.610s
  sys0m0.003s
 
  gcc -Ofast
  v is 2.09131e+13
 
  real0m7.218s
  user0m7.018s
  sys0m0.017s
 
 
  nice. one thing we need to remember though is that it's not entirely
  free, one thing -Ofast turns on is -ffast-math,
  
   This option is not turned on by any -O option besides -Ofast since it
 can
   result in incorrect output for programs that depend on an exact
   implementation of IEEE or ISO rules/specifications for math functions.
 It
   may, however, yield faster code for programs that do not require the
   guarantees of these specifications.
  
 
  which may not be fit for our purposes.
 
 
  With the ifort compiler there is '-fp-model precise' which allows only
  optimizations which don't harm the results. Maybe gcc has something
  similar.

 In gcc, you can turn of -ffoo with -fno-foo, maybe this way you can
 use -Ofast -fno-fast-math to preserve IEEE specifications.
 
  Glad to see -floop-parallelize-all in gcc 4.7, it will help us identify
  places to focus OpenMP work on.
 
 
  Hamish
 



benchmark.sh
Description: Bourne shell script
#include stdio.h
#include stdlib.h

/* #define DEBUG 1 */

/* Prototypes for gathering and average computation */
static int gather_values(double **input, double *buff, int nrows, 
int ncols, int mw_size, int col, int row, int dist);

static double average(double *values, int size);

int main(int argc, char **argv)
{
int nrows, ncols, mw_size, size, dist;
double **input = NULL, **output = NULL;
int i, j;

/* Check and parse the input parameter */
if(argc != 4)
{

fprintf(stderr, Warning!\n);
fprintf(stderr, Please specifiy the number of rows and columns and the 
\nsize of the moving window (must be an odd number)\n);
fprintf(stderr, \nUsage: neighbor 5000 5000 51\n);
fprintf(stderr, \nUsing default values: rows = 5000, cols = 5000, moving window = 51\n);
nrows = 5000;
ncols = 5000;
mw_size = 51;
} 
else 
{

sscanf(argv[1], %d, nrows);
sscanf(argv[2], %d, ncols);
sscanf(argv[3], %d, mw_size);

if(mw_size%2 == 0) {
fprintf(stderr,The size of the moving window must be odd);
return -1;
}
}

size = mw_size * mw_size;
dist = mw_size / 2;

/* Allocate input and output */
input = (double**)calloc(nrows, sizeof(double*));
output= (double**)calloc(nrows, sizeof(double*));

if(input == NULL || output == NULL)
{
fprintf(stderr, Unable to allocate arrays);
return -1;
}

for(i = 0; i  nrows; i++) 
{
input[i] = (double*)calloc(ncols, sizeof(double));
output[i]= (double*)calloc(ncols, sizeof(double));

if(input[i] == NULL || output[i] == NULL)
{
fprintf(stderr, Unable to allocate arrays);
return -1;
}

#ifdef DEBUG
for(j = 0; j  ncols; j++)
input[i][j] = i + j;
#endif
}

#pragma omp parallel for private(i, j)
for(i = 0; i  nrows; i++) 
{
for(j = 0; j  ncols; j++)
{

/* Value buffer with maximum size */
double *buff = NULL;
buff = (double*)calloc(size, sizeof(double));

/* Gather value in moving window */
int num = gather_values(input, buff, nrows, ncols, mw_size, i, j, dist);

output[i][j] = average(buff, num);

free(buff);
}
}

#ifdef DEBUG
printf(\nInput\n);
for(i = 0; i  nrows; i++) 
{
for(j = 0; j  ncols; j++)
{
printf(%.2f , input[i][j]);
}
printf(\n);
}


Re: [GRASS-user] r.neighbors velocity

2013-06-29 Thread Sören Gebbert
More benchmark results on core i5 2410M 2 cores 4 threads, 8GB RAM:

gcc -Wall -fopenmp -lgomp -O3 main.c -o neighbor
time ./neighbor 5000 5000 23

export OMP_NUM_THREADS=1
real 0m27.052s
user 0m26.882s
sys 0m0.128s

export OMP_NUM_THREADS=2
real 0m15.579s
user 0m30.466s
sys 0m0.124s

export OMP_NUM_THREADS=4
real 0m10.454s
user 0m40.711s
sys 0m0.120s

gcc -Wall -fopenmp -lgomp -Ofast -march=core-avx-i main.c -o neighbor
time ./neighbor 5000 5000 23

export OMP_NUM_THREADS=1
real 0m17.090s
user 0m16.953s
sys 0m0.108s

export OMP_NUM_THREADS=2
real 0m9.957s
user 0m19.437s
sys 0m0.136s

export OMP_NUM_THREADS=4
real 0m7.476s
user 0m28.698s
sys 0m0.124s

opencc -Wall -mp -Ofast -march=auto main.c -o neighbor
time ./neighbor 5000 5000 23

export OMP_NUM_THREADS=1
real 0m19.095s
user 0m18.909s
sys 0m0.152s

export OMP_NUM_THREADS=2
real 0m11.203s
user 0m22.097s
sys 0m0.136s

export OMP_NUM_THREADS=4
real 0m8.648s
user 0m33.670s
sys 0m0.160s

Best regards
Soeren



2013/6/29 Sören Gebbert soerengebb...@googlemail.com

 Hi,
 i have implemented a real average neighborhood algorithm that runs in
 parallel using openmp. The source code and the benchmark shell script is
 attached.

 The neighbor program computes the average moving window of arbitrary size.
 The size of the map rows x cols and the size of the moving window  (odd
 number cols==rows) can be specified.

 ./neighbor rows cols mw_size

 IMHO the new program is better for compiler comparison and neighborhood
 operation performance.

 This is the benchmark on my 5 year old AMD phenom 4 core computer using 1,
 2 and 4 threads:

 gcc -Wall -fopenmp -lgomp -Ofast main.c -o neighbor
 export OMP_NUM_THREADS=1
 time ./neighbor 5000 5000 23
 real 0m37.211s
 user 0m36.998s
 sys 0m0.196s

 export OMP_NUM_THREADS=2
 time ./neighbor 5000 5000 23
 real 0m19.907s
 user 0m38.890s
 sys 0m0.248s

 export OMP_NUM_THREADS=4
 time ./neighbor 5000 5000 23
 real 0m10.170s
 user 0m38.466s
 sys 0m0.192s

 Happy hacking, compiling and testing. :)

 Best regards
 Soeren




 2013/6/29 Markus Metz markus.metz.gisw...@gmail.com

 On Sat, Jun 29, 2013 at 1:26 PM, Hamish hamis...@yahoo.com wrote:
  Markus Metz wrote:
 
  Some more results with Sören's test program on a Intel(R) Core(TM) i5
  CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and
  clang 3.3
 
  gcc -O3
  v is 2.09131e+13
 
  real2m0.393s
  user1m57.610s
  sys0m0.003s
 
  gcc -Ofast
  v is 2.09131e+13
 
  real0m7.218s
  user0m7.018s
  sys0m0.017s
 
 
  nice. one thing we need to remember though is that it's not entirely
  free, one thing -Ofast turns on is -ffast-math,
  
   This option is not turned on by any -O option besides -Ofast since it
 can
   result in incorrect output for programs that depend on an exact
   implementation of IEEE or ISO rules/specifications for math functions.
 It
   may, however, yield faster code for programs that do not require the
   guarantees of these specifications.
  
 
  which may not be fit for our purposes.
 
 
  With the ifort compiler there is '-fp-model precise' which allows only
  optimizations which don't harm the results. Maybe gcc has something
  similar.

 In gcc, you can turn of -ffoo with -fno-foo, maybe this way you can
 use -Ofast -fno-fast-math to preserve IEEE specifications.
 
  Glad to see -floop-parallelize-all in gcc 4.7, it will help us identify
  places to focus OpenMP work on.
 
 
  Hamish
 





benchmark.sh
Description: Bourne shell script
___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user


Re: [GRASS-user] r.neighbors velocity

2013-06-28 Thread Ivan Marchesini
Hi Markus
you are perfectly right

the region is 4312*5576
the moving window 501
GRASS is the stable version on a machine with 8 core and 32 gb RAM.
Ubuntu 12.04

it seems that the proprietary software is able to perform the analysis
in 2/3 seconds

:-|

ciao


On Fri, 2013-06-28 at 00:02 +0200, Markus Neteler wrote:
 On Thu, Jun 27, 2013 at 9:01 AM, Ivan Marchesini
 ivan.marches...@gmail.com wrote:
  Hi all,
  A friend of mine (having skill also in some proprietary remote sensing
  softwares) is testing GRASS for some task.
  He is quite happy about the results but he was really surprised by the
  time interval that r.neighbor take for doing the analysis. The time is
  really large in his opinion
 
 Please post some indications: computational region size and
 moving window size, also which hardware/operating system.
 Otherwise it is hard to say anything...
 
 Markus


___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user


Re: [GRASS-user] r.neighbors velocity

2013-06-28 Thread Markus Neteler
On Fri, Jun 28, 2013 at 5:05 PM, Ivan Marchesini
ivan.marches...@gmail.com wrote:
 Hi Markus
 you are perfectly right

 the region is 4312*5576
 the moving window 501

So, you are running a 501x501 moving window over that map?
For what purpose?

 GRASS is the stable version on a machine with 8 core and 32 gb RAM.
 Ubuntu 12.04

 it seems that the proprietary software is able to perform the analysis
 in 2/3 seconds

Unlikely with a 501x501 moving window...

Markus
___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user


Re: [GRASS-user] r.neighbors velocity

2013-06-28 Thread Sören Gebbert
Hi Ivan,
this sounds very interesting.

Your map has a size of 4312*5576 pixel? That's about 100MB in case of a
type integer or type float map or about 200MB in case of a type double map.
You must have a very fast HD or SSD to read and write such a map in under
2/3 seconds?

In case your moving window has a size of 501 pixel (not 501x501 pixel!),
the amount of operations that must be performed is at least 4312*5576*501.
That's about 12 billion ops. Amazing to do this in 2/3 seconds.
I have written a little program to see how my Intel core i5 performs
processing this amount of operations. Well it needs about 100 seconds.

Here the code, compiled with optimization:

#include stdio.h

int main()
{
unsigned int i, j, k;
register double v = 0.0;

for(i = 0; i  4321; i++) {
for(j = 0; j  5576; j++) {
for(k = 0; k  501; k++) {
v = v + (double)(i + j + k)/3.0;
}
}
}
printf(v is %g\n, v);
}

soeren@vostro:~/src$ gcc -O3 numtest.c -o numtest
soeren@vostro:~/src$ time ./numtest
v is 2.09131e+13

real 1m49.292s
user 1m49.223s
sys 0m0.000s


Your proprietary software must run highly parallel using a fast GPU or an
ASIC to keep the processing time under 2/3 seconds?

Unfortunately r.neighbors is not able to compete with such a powerful
software, since it is not reading the entire map into RAM and does not run
on GPU's or ASIC's. But r.neighbors is able to process maps that are to
large to fit into the RAM. :)

Can you please tell us what software is so incredible fast?

Best regards
Soeren


2013/6/28 Ivan Marchesini ivan.marches...@gmail.com

 Hi Markus
 you are perfectly right

 the region is 4312*5576
 the moving window 501
 GRASS is the stable version on a machine with 8 core and 32 gb RAM.
 Ubuntu 12.04

 it seems that the proprietary software is able to perform the analysis
 in 2/3 seconds

 :-|

 ciao


 On Fri, 2013-06-28 at 00:02 +0200, Markus Neteler wrote:
  On Thu, Jun 27, 2013 at 9:01 AM, Ivan Marchesini
  ivan.marches...@gmail.com wrote:
   Hi all,
   A friend of mine (having skill also in some proprietary remote sensing
   softwares) is testing GRASS for some task.
   He is quite happy about the results but he was really surprised by the
   time interval that r.neighbor take for doing the analysis. The time is
   really large in his opinion
 
  Please post some indications: computational region size and
  moving window size, also which hardware/operating system.
  Otherwise it is hard to say anything...
 
  Markus


 ___
 grass-user mailing list
 grass-user@lists.osgeo.org
 http://lists.osgeo.org/mailman/listinfo/grass-user

___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user


Re: [GRASS-user] r.neighbors velocity

2013-06-28 Thread Hamish


Ivan wrote:
 the region is 4312*5576
 the moving window 501
 GRASS is the stable version on a machine with 8 core and 32 gb RAM.
 Ubuntu 12.04

 it seems that the proprietary software is able to perform the analysis
 in 2/3 seconds

I expect he's probably correct in that statement, but it's the *compiler*
used not the code behind it, and GRASS compiled in the same way would
be/is just as fast.


Sören wrote:
 this sounds very interesting.

 Your map has a size of 4312*5576 pixel? That's about 100MB in case of
 a type integer or type float map or about 200MB in case of a type
 double map. You must have a very fast HD or SSD to read and write such
 a map in under 2/3 seconds?

500mb/s IO for a SSD is not unusual, 300mb/s for spinning platter RAID
is pretty common. It's good to run a few replicates of the benchmark
so the 2nd+ times the data is already cached in RAM. (as long as the
region is not too huge to hold it there)


 In case your moving window has a size of 501 pixel (not 501x501 pixel!),
 the amount of operations that must be performed is at least 4312*5576*501.
 That's about 12 billion ops. Amazing to do this in 2/3 seconds.
 I have written a little program to see how my Intel core i5 performs
 processing this amount of operations. Well it needs about 100 seconds.

I was able get the same down to just over 1 second wall-time on a plain
consumer desktop chip. (!)


 Here the code, compiled with optimization:

#include stdio.h


int main()
{
    unsigned int i, j, k;
    register double v = 0.0;


    for(i = 0; i  4321; i++) {
    for(j = 0; j  5576; j++) {
    for(k = 0; k  501; k++) {
    v = v + (double)(i + j + k)/3.0;
    }
    }
    }
    printf(v is %g\n, v);
}

 soeren@vostro:~/src$ gcc -O3 numtest.c -o numtest
 soeren@vostro:~/src$ time ./numtest 
 v is 2.09131e+13

 real1m49.292s
 user1m49.223s
 sys0m0.000s

 Your proprietary software must run highly parallel using a fast
 GPU or an ASIC to keep the processing time under 2/3 seconds?

 Unfortunately r.neighbors is not able to compete with such a
 powerful software,

sure it is! :)


 since it is not reading the entire map into RAM and does not run
 on GPU's or ASIC's. But r.neighbors is able to process maps that
 are to large to fit into the RAM. :)

 Can you please tell us what software is so incredible fast?

I ran some quick trials with your sample program with both gcc 4.6
(ubuntu 12.04) and Intel's icc 12.1 on the same computer.

Diplomatically speaking, I see gcc 4.8 has just arrived in Debian/sid,
and I look forward to exploring how its new auto-vectorization features
are coming along.

The results however, are not so diplomatic and speak for themselves..
and for this simple test case* it isn't pretty.
(* so atypically easy for the compiler to optimize)


test system: i7 3770, lots of RAM
replicates are presented in horizontal columns.


standard gcc, with  without -O3 and -march=native: (all ~same)
real  1m14.507s | 1m14.559s | 1m14.513s | 1m14.514s
user  1m14.289s | 1m14.305s | 1m14.297s | 1m14.297s
sys   0m0.000s  | 0m0.028s  | 0m0.000s  | 0m0.000s
--

standard Intel icc with  without -O3:
v is 2.09131e+13

real  0m21.979s | 0m21.967s | 0m21.958s | 0m21.994s
user  0m21.909s | 0m21.901s | 0m21.897s | 0m21.929s
sys   0m0.000s  | 0m0.000s  | 0m0.000s  | 0m0.000s
--

icc with the -fast compiler switch:
$ icc -fast soeren_speed_test.c -o soeren_speed_test_icc_fast
   # note 900kb for executable vs's gcc's 8kb.
$ time ./soeren_speed_test_icc_fast
v is 2.09131e+13

real  0m3.273s | 0m3.274s | 0m3.275s
user  0m3.260s | 0m3.260s | 0m3.264s
sys   0m0.000s | 0m0.000s | 0m0.000s

(there's your 3 seconds)
--

icc -funroll-loops:
real  0m22.008s | 0m21.998s
user  0m21.941s | 0m21.929s
sys   0m0.000s  | 0m0.000s

(no extra gain in this case)
--

icc -parallel:  (running on 8 hyperthread (ie 4 real) cores)
   # binary size: 30kb
real  0m6.034s  | 0m6.005s  | 0m6.005s
user  0m46.531s | 0m46.603s | 0m46.519s
sys   0m0.024s  | 0m0.028s  | 0m0.044s
--

icc -parallel -fast:
   # binary size 2.2 megabytes
$ time ./soeren_speed_test_icc_parallel+fast 
v is 2.09131e+13

real  0m1.002s | 0m1.002s | 0m1.002s
user  0m6.768s | 0m6.796s | 0m6.780s
sys   0m0.004s | 0m0.004s | 0m0.008s

I tried a number of times but couldn't break the 1 second barrier. :)


-
I also ran it on an AMD Phenom II X6 1090T  (icc -xHost -- -xSSSE3 ?)
All times real; all output was v is 2.09131e+13.

gcc 4.4.5 with standard-opts: 7kb binary
 == near parity single-threaded performance with the new i7 chip from
    the 2 year old AMD Phenom and older copy of gcc! (stock debian/squeeze)
  1m16.175s | 1m15.634s | 1m16.029s

icc 12.1 with standard-opts:
  0m32.975s | 0m33.079s | 0m33.249s

icc with -fast opt: (700kb binary)
  0m9.577s | 0m9.572s | 0m9.583s

icc with -parallel auto-MP: (31kb binary)
 == again near parity with the new i7 chip! even with the Intel-biased 
   

[GRASS-user] r.neighbors velocity

2013-06-27 Thread Ivan Marchesini
Hi all,
A friend of mine (having skill also in some proprietary remote sensing
softwares) is testing GRASS for some task.
He is quite happy about the results but he was really surprised by the
time interval that r.neighbor take for doing the analysis. The time is
really large in his opinion. Do you feel the same in your experience? is
r.mfilter faster? do you have other ideas to perform this kind of kernel
based analyses using GRASS or other OS software?

many thanks 

Ivan   



___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user


Re: [GRASS-user] r.neighbors velocity

2013-06-27 Thread Markus Neteler
On Thu, Jun 27, 2013 at 9:01 AM, Ivan Marchesini
ivan.marches...@gmail.com wrote:
 Hi all,
 A friend of mine (having skill also in some proprietary remote sensing
 softwares) is testing GRASS for some task.
 He is quite happy about the results but he was really surprised by the
 time interval that r.neighbor take for doing the analysis. The time is
 really large in his opinion

Please post some indications: computational region size and
moving window size, also which hardware/operating system.
Otherwise it is hard to say anything...

Markus
___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user