Re: [GRASS-user] r.neighbors velocity
Hi Ivan, 2013/7/17 Ivan Marchesini ivan.marches...@gmail.com: Dear Soeren, Hamish, Markus M., Markus N,. sorry for the delay in the answer but I didn't believe that my question could have determined so many answers and I was on holiday for 10 days with no way to test your code. First of all a quick answer to Markus N. My collegue is working on high resolution images and DEMs. In this case, in particular, he is working on a 1 meter resolution raster map concerning landslides. He said me that, due to the landslides dimensions, he need to use a kernel of 501*501 in order to catch the signatures of the phenomena (I don't know exactly the details.. I'm sorry). I think it will be good if you could ask your colleague for more details, since we are all very curious about his computational approach. I'm not a C developer but reading your e-mails it seems that the performances of the C codes (and r.neigbors is written in C) strongly depend on the compiler. does it mean that compiling in a different way (I don't know how) the r.neigbours module we can obtain better results? We have tested the last code of Soeren in the same machine where the proprietary software (it is ENVI 5.x) showed those good performances. The performance is good indeed. Unfortunately you need to set the moving windows size from 23 to 501, to receive a comparable result to the computation of your colleague. ./neighbor 5000 5000 501 This will call the neíghbour computation with 25,000,000 cells and a moving window with 501x501 pixel. Best regards Soeren These are the results: gcc -Wall -fopenmp -lgomp -Ofast main.c -o neighbor export OMP_NUM_THREADS=1 time ./neighbor 5000 5000 23 real0m16.598s user0m16.477s sys0m0.080s export OMP_NUM_THREADS=2 time ./neighbor 5000 5000 23 real0m8.977s user0m17.573s sys0m0.080s export OMP_NUM_THREADS=4 time ./neighbor 5000 5000 23 real0m5.993s user0m20.277s sys0m0.088s export OMP_NUM_THREADS=6 time ./neighbor 5000 5000 23 real0m4.784s user0m25.770s sys0m0.096s Many thanks for your answers ___ grass-user mailing list grass-user@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-user
Re: [GRASS-user] r.neighbors velocity
Dear Soeren, Hamish, Markus M., Markus N,. sorry for the delay in the answer but I didn't believe that my question could have determined so many answers and I was on holiday for 10 days with no way to test your code. First of all a quick answer to Markus N. My collegue is working on high resolution images and DEMs. In this case, in particular, he is working on a 1 meter resolution raster map concerning landslides. He said me that, due to the landslides dimensions, he need to use a kernel of 501*501 in order to catch the signatures of the phenomena (I don't know exactly the details.. I'm sorry). I'm not a C developer but reading your e-mails it seems that the performances of the C codes (and r.neigbors is written in C) strongly depend on the compiler. does it mean that compiling in a different way (I don't know how) the r.neigbours module we can obtain better results? We have tested the last code of Soeren in the same machine where the proprietary software (it is ENVI 5.x) showed those good performances. These are the results: gcc -Wall -fopenmp -lgomp -Ofast main.c -o neighbor export OMP_NUM_THREADS=1 time ./neighbor 5000 5000 23 real0m16.598s user0m16.477s sys0m0.080s export OMP_NUM_THREADS=2 time ./neighbor 5000 5000 23 real0m8.977s user0m17.573s sys0m0.080s export OMP_NUM_THREADS=4 time ./neighbor 5000 5000 23 real0m5.993s user0m20.277s sys0m0.088s export OMP_NUM_THREADS=6 time ./neighbor 5000 5000 23 real0m4.784s user0m25.770s sys0m0.096s Many thanks for your answers ___ grass-user mailing list grass-user@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-user
Re: [GRASS-user] r.neighbors velocity
Hi, here are the same results for Soeren's test program, with the Open64 compiler from AMD: - Same AMD X6 CPU as below. - Open64 compiler 4.5.2.1 from AMD (GPLv2, LGPL) I just downloaded the pre-built RHEL5 binary tarball and they worked on Debian/squeeze, I just made an alias to the executable in the un- tarred bin/ dir to get it to work. see also http://wiki.open64.net/index.php/Installation_on_Ubuntu Source is available of course, but according to the Debian ITP ticket it's a bit of a pain to build there. straight opencc: real 0m59.015s | 0m58.972s | 0m58.963s user 0m58.760s | 0m58.812s | 0m58.624s sys 0m0.248s | 0m0.136s | 0m0.300s -- opencc -O3: real 0m35.203s | 0m35.173s | 0m35.204s user 0m35.206s | 0m35.174s | 0m35.206s sys 0m0.000s | 0m0.000s | 0m0.000s -- opencc -Ofast (with or without -march=auto for native bytecode) real 0m13.389s | 0m13.402s | 0m13.435s user 0m13.389s | 0m13.405s | 0m13.437s sys 0m0.000s | 0m0.000s | 0m0.000s -- opencc -Ofast -march=auto -apo on a 6-(real)-core CPU v is 2.09131e+13 real 0m2.552s | 0m2.595s | 0m2.591s user 0m14.857s | 0m14.725s | 0m14.725s sys 0m0.008s | 0m0.024s | 0m0.016s '-apo' is autoparallelization, poorly documented, but it works! it adds OpenMP pragmas where it thinks it can where it will cause a gain; I'm glad to see it's not just for the fotran compiler anymore. So the Open64 compiler is not quite as fast as Intel's one for this test case, but it's pretty close versus the more versatile gcc in the far distance. Executable file size for all of the above was less than 12kb, since it can link to local OS shared libs. I haven't tried it with llvm/clang. Now I wonder which flags to use to recreate -Ofast in gcc to make it a fairer comparison.. Hamish I also ran it on an AMD Phenom II X6 1090T (icc -xHost -- -xSSSE3 ?) All times real; all output was v is 2.09131e+13. gcc 4.4.5 with standard-opts: 7kb binary == near parity single-threaded performance with the new i7 chip from the 2 year old AMD Phenom and older copy of gcc! (stock debian/squeeze) 1m16.175s | 1m15.634s | 1m16.029s icc 12.1 with standard-opts: 0m32.975s | 0m33.079s | 0m33.249s icc with -fast opt: (700kb binary) 0m9.577s | 0m9.572s | 0m9.583s icc with -parallel auto-MP: (31kb binary) == again near parity with the new i7 chip! even with the Intel-biased compiler. user cpu-time was actually less. the advantage of 6 real cores vs 4 real+4virtual ones.* 0m6.406s | 0m6.404s | 0m6.404s 0m37.106s | 0m37.170s | 0m37.106s 0m0.044s | 0m0.040s | 0m0.028s icc with -fast and -parallel: (2mb binary) 0m2.002s | 0m2.002s | 0m2.002s 0m10.765s | 0m10.769s | 0m10.769s 0m0.016s | 0m0.012s | 0m0.008s ___ grass-user mailing list grass-user@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-user
Re: [GRASS-user] r.neighbors velocity
Some more results with Sören's test program on a Intel(R) Core(TM) i5 CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and clang 3.3 gcc -O3 v is 2.09131e+13 real2m0.393s user1m57.610s sys 0m0.003s gcc -Ofast v is 2.09131e+13 real0m7.218s user0m7.018s sys 0m0.017s gcc -Ofast -floop-parallelize-all is as fast as gcc -Ofast clang -Ofast v is 2.09131e+13 real0m18.701s user0m18.285s sys 0m0.000s Markus M On Sat, Jun 29, 2013 at 8:35 AM, Hamish hamis...@yahoo.com wrote: Hi, here are the same results for Soeren's test program, with the Open64 compiler from AMD: - Same AMD X6 CPU as below. - Open64 compiler 4.5.2.1 from AMD (GPLv2, LGPL) I just downloaded the pre-built RHEL5 binary tarball and they worked on Debian/squeeze, I just made an alias to the executable in the un- tarred bin/ dir to get it to work. see also http://wiki.open64.net/index.php/Installation_on_Ubuntu Source is available of course, but according to the Debian ITP ticket it's a bit of a pain to build there. straight opencc: real 0m59.015s | 0m58.972s | 0m58.963s user 0m58.760s | 0m58.812s | 0m58.624s sys 0m0.248s | 0m0.136s | 0m0.300s -- opencc -O3: real0m35.203s | 0m35.173s | 0m35.204s user0m35.206s | 0m35.174s | 0m35.206s sys 0m0.000s | 0m0.000s | 0m0.000s -- opencc -Ofast (with or without -march=auto for native bytecode) real 0m13.389s | 0m13.402s | 0m13.435s user 0m13.389s | 0m13.405s | 0m13.437s sys 0m0.000s | 0m0.000s | 0m0.000s -- opencc -Ofast -march=auto -apo on a 6-(real)-core CPU v is 2.09131e+13 real 0m2.552s | 0m2.595s | 0m2.591s user 0m14.857s | 0m14.725s | 0m14.725s sys 0m0.008s | 0m0.024s | 0m0.016s '-apo' is autoparallelization, poorly documented, but it works! it adds OpenMP pragmas where it thinks it can where it will cause a gain; I'm glad to see it's not just for the fotran compiler anymore. So the Open64 compiler is not quite as fast as Intel's one for this test case, but it's pretty close versus the more versatile gcc in the far distance. Executable file size for all of the above was less than 12kb, since it can link to local OS shared libs. I haven't tried it with llvm/clang. Now I wonder which flags to use to recreate -Ofast in gcc to make it a fairer comparison.. Hamish I also ran it on an AMD Phenom II X6 1090T (icc -xHost -- -xSSSE3 ?) All times real; all output was v is 2.09131e+13. gcc 4.4.5 with standard-opts: 7kb binary == near parity single-threaded performance with the new i7 chip from the 2 year old AMD Phenom and older copy of gcc! (stock debian/squeeze) 1m16.175s | 1m15.634s | 1m16.029s icc 12.1 with standard-opts: 0m32.975s | 0m33.079s | 0m33.249s icc with -fast opt: (700kb binary) 0m9.577s | 0m9.572s | 0m9.583s icc with -parallel auto-MP: (31kb binary) == again near parity with the new i7 chip! even with the Intel-biased compiler. user cpu-time was actually less. the advantage of 6 real cores vs 4 real+4virtual ones.* 0m6.406s | 0m6.404s | 0m6.404s 0m37.106s | 0m37.170s | 0m37.106s 0m0.044s | 0m0.040s | 0m0.028s icc with -fast and -parallel: (2mb binary) 0m2.002s | 0m2.002s | 0m2.002s 0m10.765s | 0m10.769s | 0m10.769s 0m0.016s | 0m0.012s | 0m0.008s ___ grass-user mailing list grass-user@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-user ___ grass-user mailing list grass-user@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-user
Re: [GRASS-user] r.neighbors velocity
Markus Metz wrote: Some more results with Sören's test program on a Intel(R) Core(TM) i5 CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and clang 3.3 gcc -O3 v is 2.09131e+13 real 2m0.393s user 1m57.610s sys 0m0.003s gcc -Ofast v is 2.09131e+13 real 0m7.218s user 0m7.018s sys 0m0.017s nice. one thing we need to remember though is that it's not entirely free, one thing -Ofast turns on is -ffast-math, This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications. which may not be fit for our purposes. With the ifort compiler there is '-fp-model precise' which allows only optimizations which don't harm the results. Maybe gcc has something similar. Glad to see -floop-parallelize-all in gcc 4.7, it will help us identify places to focus OpenMP work on. Hamish ___ grass-user mailing list grass-user@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-user
Re: [GRASS-user] r.neighbors velocity
On Sat, Jun 29, 2013 at 1:26 PM, Hamish hamis...@yahoo.com wrote: Markus Metz wrote: Some more results with Sören's test program on a Intel(R) Core(TM) i5 CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and clang 3.3 gcc -O3 v is 2.09131e+13 real2m0.393s user1m57.610s sys0m0.003s gcc -Ofast v is 2.09131e+13 real0m7.218s user0m7.018s sys0m0.017s nice. one thing we need to remember though is that it's not entirely free, one thing -Ofast turns on is -ffast-math, This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications. which may not be fit for our purposes. With the ifort compiler there is '-fp-model precise' which allows only optimizations which don't harm the results. Maybe gcc has something similar. In gcc, you can turn of -ffoo with -fno-foo, maybe this way you can use -Ofast -fno-fast-math to preserve IEEE specifications. Glad to see -floop-parallelize-all in gcc 4.7, it will help us identify places to focus OpenMP work on. Hamish ___ grass-user mailing list grass-user@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-user
Re: [GRASS-user] r.neighbors velocity
Hey Folks, many thanks for pointing to the important influence of different compiler and compiler options. But please be aware that my tiny little program is not representative for a neighbor analysis implementation, it was simply a demonstration of 12 billion ops: 1. I use fixed loop sizes, it is really easy for a compiler to optimize that 2. It is pretty simple to parallelize since only a simple reduction is done in the inner loop 3. Most important: The statement of Ivan was a window size of 501 ... as MarkusN IMHO correctly interpreted this leads to a moving window of 501x501 pixel if this is an option for r.neighbors. It is not the total number of cells of a rectangular moving window, since it must must be an even number in this case. Other shapes than rectangular are more complex to implement. To be diplomatic i decided to use 501 pixel, which might represented a 23x21 pixel moving window, to show that this small number of operations needs a considerable amount of time on modern CPU's. If you use a 501x501 pixel moving window the computational effort is roughly 501 times 12 billion ops. IMHO in this case a GPU or neighbor algorithm specific FPGA/ASIC may be able to perform this operation in 2/3 seconds. Best regards Soeren ___ grass-user mailing list grass-user@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-user
Re: [GRASS-user] r.neighbors velocity
Hi, i have implemented a real average neighborhood algorithm that runs in parallel using openmp. The source code and the benchmark shell script is attached. The neighbor program computes the average moving window of arbitrary size. The size of the map rows x cols and the size of the moving window (odd number cols==rows) can be specified. ./neighbor rows cols mw_size IMHO the new program is better for compiler comparison and neighborhood operation performance. This is the benchmark on my 5 year old AMD phenom 4 core computer using 1, 2 and 4 threads: gcc -Wall -fopenmp -lgomp -Ofast main.c -o neighbor export OMP_NUM_THREADS=1 time ./neighbor 5000 5000 23 real 0m37.211s user 0m36.998s sys 0m0.196s export OMP_NUM_THREADS=2 time ./neighbor 5000 5000 23 real 0m19.907s user 0m38.890s sys 0m0.248s export OMP_NUM_THREADS=4 time ./neighbor 5000 5000 23 real 0m10.170s user 0m38.466s sys 0m0.192s Happy hacking, compiling and testing. :) Best regards Soeren 2013/6/29 Markus Metz markus.metz.gisw...@gmail.com On Sat, Jun 29, 2013 at 1:26 PM, Hamish hamis...@yahoo.com wrote: Markus Metz wrote: Some more results with Sören's test program on a Intel(R) Core(TM) i5 CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and clang 3.3 gcc -O3 v is 2.09131e+13 real2m0.393s user1m57.610s sys0m0.003s gcc -Ofast v is 2.09131e+13 real0m7.218s user0m7.018s sys0m0.017s nice. one thing we need to remember though is that it's not entirely free, one thing -Ofast turns on is -ffast-math, This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications. which may not be fit for our purposes. With the ifort compiler there is '-fp-model precise' which allows only optimizations which don't harm the results. Maybe gcc has something similar. In gcc, you can turn of -ffoo with -fno-foo, maybe this way you can use -Ofast -fno-fast-math to preserve IEEE specifications. Glad to see -floop-parallelize-all in gcc 4.7, it will help us identify places to focus OpenMP work on. Hamish benchmark.sh Description: Bourne shell script #include stdio.h #include stdlib.h /* #define DEBUG 1 */ /* Prototypes for gathering and average computation */ static int gather_values(double **input, double *buff, int nrows, int ncols, int mw_size, int col, int row, int dist); static double average(double *values, int size); int main(int argc, char **argv) { int nrows, ncols, mw_size, size, dist; double **input = NULL, **output = NULL; int i, j; /* Check and parse the input parameter */ if(argc != 4) { fprintf(stderr, Warning!\n); fprintf(stderr, Please specifiy the number of rows and columns and the \nsize of the moving window (must be an odd number)\n); fprintf(stderr, \nUsage: neighbor 5000 5000 51\n); fprintf(stderr, \nUsing default values: rows = 5000, cols = 5000, moving window = 51\n); nrows = 5000; ncols = 5000; mw_size = 51; } else { sscanf(argv[1], %d, nrows); sscanf(argv[2], %d, ncols); sscanf(argv[3], %d, mw_size); if(mw_size%2 == 0) { fprintf(stderr,The size of the moving window must be odd); return -1; } } size = mw_size * mw_size; dist = mw_size / 2; /* Allocate input and output */ input = (double**)calloc(nrows, sizeof(double*)); output= (double**)calloc(nrows, sizeof(double*)); if(input == NULL || output == NULL) { fprintf(stderr, Unable to allocate arrays); return -1; } for(i = 0; i nrows; i++) { input[i] = (double*)calloc(ncols, sizeof(double)); output[i]= (double*)calloc(ncols, sizeof(double)); if(input[i] == NULL || output[i] == NULL) { fprintf(stderr, Unable to allocate arrays); return -1; } #ifdef DEBUG for(j = 0; j ncols; j++) input[i][j] = i + j; #endif } #pragma omp parallel for private(i, j) for(i = 0; i nrows; i++) { for(j = 0; j ncols; j++) { /* Value buffer with maximum size */ double *buff = NULL; buff = (double*)calloc(size, sizeof(double)); /* Gather value in moving window */ int num = gather_values(input, buff, nrows, ncols, mw_size, i, j, dist); output[i][j] = average(buff, num); free(buff); } } #ifdef DEBUG printf(\nInput\n); for(i = 0; i nrows; i++) { for(j = 0; j ncols; j++) { printf(%.2f , input[i][j]); } printf(\n); }
Re: [GRASS-user] r.neighbors velocity
More benchmark results on core i5 2410M 2 cores 4 threads, 8GB RAM: gcc -Wall -fopenmp -lgomp -O3 main.c -o neighbor time ./neighbor 5000 5000 23 export OMP_NUM_THREADS=1 real 0m27.052s user 0m26.882s sys 0m0.128s export OMP_NUM_THREADS=2 real 0m15.579s user 0m30.466s sys 0m0.124s export OMP_NUM_THREADS=4 real 0m10.454s user 0m40.711s sys 0m0.120s gcc -Wall -fopenmp -lgomp -Ofast -march=core-avx-i main.c -o neighbor time ./neighbor 5000 5000 23 export OMP_NUM_THREADS=1 real 0m17.090s user 0m16.953s sys 0m0.108s export OMP_NUM_THREADS=2 real 0m9.957s user 0m19.437s sys 0m0.136s export OMP_NUM_THREADS=4 real 0m7.476s user 0m28.698s sys 0m0.124s opencc -Wall -mp -Ofast -march=auto main.c -o neighbor time ./neighbor 5000 5000 23 export OMP_NUM_THREADS=1 real 0m19.095s user 0m18.909s sys 0m0.152s export OMP_NUM_THREADS=2 real 0m11.203s user 0m22.097s sys 0m0.136s export OMP_NUM_THREADS=4 real 0m8.648s user 0m33.670s sys 0m0.160s Best regards Soeren 2013/6/29 Sören Gebbert soerengebb...@googlemail.com Hi, i have implemented a real average neighborhood algorithm that runs in parallel using openmp. The source code and the benchmark shell script is attached. The neighbor program computes the average moving window of arbitrary size. The size of the map rows x cols and the size of the moving window (odd number cols==rows) can be specified. ./neighbor rows cols mw_size IMHO the new program is better for compiler comparison and neighborhood operation performance. This is the benchmark on my 5 year old AMD phenom 4 core computer using 1, 2 and 4 threads: gcc -Wall -fopenmp -lgomp -Ofast main.c -o neighbor export OMP_NUM_THREADS=1 time ./neighbor 5000 5000 23 real 0m37.211s user 0m36.998s sys 0m0.196s export OMP_NUM_THREADS=2 time ./neighbor 5000 5000 23 real 0m19.907s user 0m38.890s sys 0m0.248s export OMP_NUM_THREADS=4 time ./neighbor 5000 5000 23 real 0m10.170s user 0m38.466s sys 0m0.192s Happy hacking, compiling and testing. :) Best regards Soeren 2013/6/29 Markus Metz markus.metz.gisw...@gmail.com On Sat, Jun 29, 2013 at 1:26 PM, Hamish hamis...@yahoo.com wrote: Markus Metz wrote: Some more results with Sören's test program on a Intel(R) Core(TM) i5 CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and clang 3.3 gcc -O3 v is 2.09131e+13 real2m0.393s user1m57.610s sys0m0.003s gcc -Ofast v is 2.09131e+13 real0m7.218s user0m7.018s sys0m0.017s nice. one thing we need to remember though is that it's not entirely free, one thing -Ofast turns on is -ffast-math, This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications. which may not be fit for our purposes. With the ifort compiler there is '-fp-model precise' which allows only optimizations which don't harm the results. Maybe gcc has something similar. In gcc, you can turn of -ffoo with -fno-foo, maybe this way you can use -Ofast -fno-fast-math to preserve IEEE specifications. Glad to see -floop-parallelize-all in gcc 4.7, it will help us identify places to focus OpenMP work on. Hamish benchmark.sh Description: Bourne shell script ___ grass-user mailing list grass-user@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-user
Re: [GRASS-user] r.neighbors velocity
Hi Markus you are perfectly right the region is 4312*5576 the moving window 501 GRASS is the stable version on a machine with 8 core and 32 gb RAM. Ubuntu 12.04 it seems that the proprietary software is able to perform the analysis in 2/3 seconds :-| ciao On Fri, 2013-06-28 at 00:02 +0200, Markus Neteler wrote: On Thu, Jun 27, 2013 at 9:01 AM, Ivan Marchesini ivan.marches...@gmail.com wrote: Hi all, A friend of mine (having skill also in some proprietary remote sensing softwares) is testing GRASS for some task. He is quite happy about the results but he was really surprised by the time interval that r.neighbor take for doing the analysis. The time is really large in his opinion Please post some indications: computational region size and moving window size, also which hardware/operating system. Otherwise it is hard to say anything... Markus ___ grass-user mailing list grass-user@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-user
Re: [GRASS-user] r.neighbors velocity
On Fri, Jun 28, 2013 at 5:05 PM, Ivan Marchesini ivan.marches...@gmail.com wrote: Hi Markus you are perfectly right the region is 4312*5576 the moving window 501 So, you are running a 501x501 moving window over that map? For what purpose? GRASS is the stable version on a machine with 8 core and 32 gb RAM. Ubuntu 12.04 it seems that the proprietary software is able to perform the analysis in 2/3 seconds Unlikely with a 501x501 moving window... Markus ___ grass-user mailing list grass-user@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-user
Re: [GRASS-user] r.neighbors velocity
Hi Ivan, this sounds very interesting. Your map has a size of 4312*5576 pixel? That's about 100MB in case of a type integer or type float map or about 200MB in case of a type double map. You must have a very fast HD or SSD to read and write such a map in under 2/3 seconds? In case your moving window has a size of 501 pixel (not 501x501 pixel!), the amount of operations that must be performed is at least 4312*5576*501. That's about 12 billion ops. Amazing to do this in 2/3 seconds. I have written a little program to see how my Intel core i5 performs processing this amount of operations. Well it needs about 100 seconds. Here the code, compiled with optimization: #include stdio.h int main() { unsigned int i, j, k; register double v = 0.0; for(i = 0; i 4321; i++) { for(j = 0; j 5576; j++) { for(k = 0; k 501; k++) { v = v + (double)(i + j + k)/3.0; } } } printf(v is %g\n, v); } soeren@vostro:~/src$ gcc -O3 numtest.c -o numtest soeren@vostro:~/src$ time ./numtest v is 2.09131e+13 real 1m49.292s user 1m49.223s sys 0m0.000s Your proprietary software must run highly parallel using a fast GPU or an ASIC to keep the processing time under 2/3 seconds? Unfortunately r.neighbors is not able to compete with such a powerful software, since it is not reading the entire map into RAM and does not run on GPU's or ASIC's. But r.neighbors is able to process maps that are to large to fit into the RAM. :) Can you please tell us what software is so incredible fast? Best regards Soeren 2013/6/28 Ivan Marchesini ivan.marches...@gmail.com Hi Markus you are perfectly right the region is 4312*5576 the moving window 501 GRASS is the stable version on a machine with 8 core and 32 gb RAM. Ubuntu 12.04 it seems that the proprietary software is able to perform the analysis in 2/3 seconds :-| ciao On Fri, 2013-06-28 at 00:02 +0200, Markus Neteler wrote: On Thu, Jun 27, 2013 at 9:01 AM, Ivan Marchesini ivan.marches...@gmail.com wrote: Hi all, A friend of mine (having skill also in some proprietary remote sensing softwares) is testing GRASS for some task. He is quite happy about the results but he was really surprised by the time interval that r.neighbor take for doing the analysis. The time is really large in his opinion Please post some indications: computational region size and moving window size, also which hardware/operating system. Otherwise it is hard to say anything... Markus ___ grass-user mailing list grass-user@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-user ___ grass-user mailing list grass-user@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-user
Re: [GRASS-user] r.neighbors velocity
Ivan wrote: the region is 4312*5576 the moving window 501 GRASS is the stable version on a machine with 8 core and 32 gb RAM. Ubuntu 12.04 it seems that the proprietary software is able to perform the analysis in 2/3 seconds I expect he's probably correct in that statement, but it's the *compiler* used not the code behind it, and GRASS compiled in the same way would be/is just as fast. Sören wrote: this sounds very interesting. Your map has a size of 4312*5576 pixel? That's about 100MB in case of a type integer or type float map or about 200MB in case of a type double map. You must have a very fast HD or SSD to read and write such a map in under 2/3 seconds? 500mb/s IO for a SSD is not unusual, 300mb/s for spinning platter RAID is pretty common. It's good to run a few replicates of the benchmark so the 2nd+ times the data is already cached in RAM. (as long as the region is not too huge to hold it there) In case your moving window has a size of 501 pixel (not 501x501 pixel!), the amount of operations that must be performed is at least 4312*5576*501. That's about 12 billion ops. Amazing to do this in 2/3 seconds. I have written a little program to see how my Intel core i5 performs processing this amount of operations. Well it needs about 100 seconds. I was able get the same down to just over 1 second wall-time on a plain consumer desktop chip. (!) Here the code, compiled with optimization: #include stdio.h int main() { unsigned int i, j, k; register double v = 0.0; for(i = 0; i 4321; i++) { for(j = 0; j 5576; j++) { for(k = 0; k 501; k++) { v = v + (double)(i + j + k)/3.0; } } } printf(v is %g\n, v); } soeren@vostro:~/src$ gcc -O3 numtest.c -o numtest soeren@vostro:~/src$ time ./numtest v is 2.09131e+13 real1m49.292s user1m49.223s sys0m0.000s Your proprietary software must run highly parallel using a fast GPU or an ASIC to keep the processing time under 2/3 seconds? Unfortunately r.neighbors is not able to compete with such a powerful software, sure it is! :) since it is not reading the entire map into RAM and does not run on GPU's or ASIC's. But r.neighbors is able to process maps that are to large to fit into the RAM. :) Can you please tell us what software is so incredible fast? I ran some quick trials with your sample program with both gcc 4.6 (ubuntu 12.04) and Intel's icc 12.1 on the same computer. Diplomatically speaking, I see gcc 4.8 has just arrived in Debian/sid, and I look forward to exploring how its new auto-vectorization features are coming along. The results however, are not so diplomatic and speak for themselves.. and for this simple test case* it isn't pretty. (* so atypically easy for the compiler to optimize) test system: i7 3770, lots of RAM replicates are presented in horizontal columns. standard gcc, with without -O3 and -march=native: (all ~same) real 1m14.507s | 1m14.559s | 1m14.513s | 1m14.514s user 1m14.289s | 1m14.305s | 1m14.297s | 1m14.297s sys 0m0.000s | 0m0.028s | 0m0.000s | 0m0.000s -- standard Intel icc with without -O3: v is 2.09131e+13 real 0m21.979s | 0m21.967s | 0m21.958s | 0m21.994s user 0m21.909s | 0m21.901s | 0m21.897s | 0m21.929s sys 0m0.000s | 0m0.000s | 0m0.000s | 0m0.000s -- icc with the -fast compiler switch: $ icc -fast soeren_speed_test.c -o soeren_speed_test_icc_fast # note 900kb for executable vs's gcc's 8kb. $ time ./soeren_speed_test_icc_fast v is 2.09131e+13 real 0m3.273s | 0m3.274s | 0m3.275s user 0m3.260s | 0m3.260s | 0m3.264s sys 0m0.000s | 0m0.000s | 0m0.000s (there's your 3 seconds) -- icc -funroll-loops: real 0m22.008s | 0m21.998s user 0m21.941s | 0m21.929s sys 0m0.000s | 0m0.000s (no extra gain in this case) -- icc -parallel: (running on 8 hyperthread (ie 4 real) cores) # binary size: 30kb real 0m6.034s | 0m6.005s | 0m6.005s user 0m46.531s | 0m46.603s | 0m46.519s sys 0m0.024s | 0m0.028s | 0m0.044s -- icc -parallel -fast: # binary size 2.2 megabytes $ time ./soeren_speed_test_icc_parallel+fast v is 2.09131e+13 real 0m1.002s | 0m1.002s | 0m1.002s user 0m6.768s | 0m6.796s | 0m6.780s sys 0m0.004s | 0m0.004s | 0m0.008s I tried a number of times but couldn't break the 1 second barrier. :) - I also ran it on an AMD Phenom II X6 1090T (icc -xHost -- -xSSSE3 ?) All times real; all output was v is 2.09131e+13. gcc 4.4.5 with standard-opts: 7kb binary == near parity single-threaded performance with the new i7 chip from the 2 year old AMD Phenom and older copy of gcc! (stock debian/squeeze) 1m16.175s | 1m15.634s | 1m16.029s icc 12.1 with standard-opts: 0m32.975s | 0m33.079s | 0m33.249s icc with -fast opt: (700kb binary) 0m9.577s | 0m9.572s | 0m9.583s icc with -parallel auto-MP: (31kb binary) == again near parity with the new i7 chip! even with the Intel-biased
[GRASS-user] r.neighbors velocity
Hi all, A friend of mine (having skill also in some proprietary remote sensing softwares) is testing GRASS for some task. He is quite happy about the results but he was really surprised by the time interval that r.neighbor take for doing the analysis. The time is really large in his opinion. Do you feel the same in your experience? is r.mfilter faster? do you have other ideas to perform this kind of kernel based analyses using GRASS or other OS software? many thanks Ivan ___ grass-user mailing list grass-user@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-user
Re: [GRASS-user] r.neighbors velocity
On Thu, Jun 27, 2013 at 9:01 AM, Ivan Marchesini ivan.marches...@gmail.com wrote: Hi all, A friend of mine (having skill also in some proprietary remote sensing softwares) is testing GRASS for some task. He is quite happy about the results but he was really surprised by the time interval that r.neighbor take for doing the analysis. The time is really large in his opinion Please post some indications: computational region size and moving window size, also which hardware/operating system. Otherwise it is hard to say anything... Markus ___ grass-user mailing list grass-user@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-user