Hi, all!
Currently I'm experimenting with OpenMP
https://bisqwit.iki.fi/story/howto/openmp/
--quote---
Support in different compilers
GCC (GNU Compiler Collection) supports OpenMP 4.5 since version 6.1, OpenMP 4.0
since version 4.9, OpenMP 3.1 since version 4.7, OpenMP 3.0 since version 4.4,
and OpenMP 2.5 since version 4.2. Add the commandline option -fopenmp to enable
it. OpenMP offloading is supported for Intel MIC targets only (Intel Xeon Phi
KNL + emulation) since version 5.1, and to NVidia (NVPTX) targets since version
7 or so.
[...]
The syntax
All OpenMP constructs in C and C++ are indicated with a #pragma omp followed
by parameters, ending in a newline. The pragma usually applies only into the
statement immediately following it, except for the barrier and flush commands,
which do not have associated statements.
The parallel construct
The parallel construct starts a parallel block. It creates a team of N threads
(where N is determined at runtime, usually from the number of CPU cores, but
may be affected by a few things), all of which execute the next statement (or
the next block, if the statement is a {…} -enclosure). After the statement, the
threads join back into one.
#pragma omp parallel
{
// Code inside this region runs in parallel.
printf("Hello!\n");
}
This code creates a team of threads, and each thread executes the same code.
It prints the text "Hello!" followed by a newline, as many times as there are
threads in the team created. For a dual-core system, it will output the text
twice. (Note: It may also output something like "HeHlellolo", depending on
system, because the printing happens in parallel.) At the }, the threads are
joined back into one, as if in non-threaded program.
Internally, GCC implements this by creating a magic function and moving the
associated code into that function, so that all the variables declared within
that block become local variables of that function (and thus, locals to each
thread).
ICC, on the other hand, uses a mechanism resembling fork(), and does not
create a magic function. Both implementations are, of course, valid, and
semantically identical.
Variables shared from the context are handled transparently, sometimes by
passing a reference and sometimes by using register variables which are flushed
at the end of the parallel block (or whenever a flush is executed).
--quote end---
http://gregslabaugh.net/publications/OpenMP_SPM.pdf
Multicore Image Processing with OpenMP
Greg Slabaugh, Richard Boyes, Xiaoyun Yang
https://nccastaff.bournemouth.ac.uk/jmacey/OpenMP/
-quote-
OpenMP by Rob Bateman
Introduction
OpenMP is an open standard that lets you easily make use of multi-threaded
processors. It's currently supported by the following compilers: Visual C++,
gcc (though not the Win32 version that comes with cygwin), XCode, and the Intel
compiler; and It's supported on the following platforms: Win32, Linux, MacOS,
XBox360*, and PS3*.
* Not amazingly well on those platforms
--quote end--
I used bcast2000 example , namely bcast/overlayframe.C
and those CFLAGS:
CFLAGS = -O3 -fpermissive -fomit-frame-pointer -march=pentium3 -ffast-math
-mfpmath=both -fopenmp -I/usr/local/include
+ enabled linking with libgomp (gcc 5.5.0) by adding -lgomp to
bcast-2000c/bcast/Makefile
it makes code slower, so far :}
but it eats all processors :} unlike original code
diff --git a/bcast/overlayframe.C b/bcast/overlayframe.C
index 9347687..d941d6e 100644
--- a/bcast/overlayframe.C
+++ b/bcast/overlayframe.C
@@ -256,7 +256,7 @@ int OverlayFrame::transfer_scale_f(VFrame *output, VFrame *input_v, unsigned cha
int *yinput_pixel2;
float *yinput_fraction1;
float *yinput_fraction2;
- register int y_out, x_out, h_out, w_out;
+ int y_out, x_out, h_out, w_out;
xinput_pixel1 = new int[output->get_w() + 1];
xinput_pixel2 = new int[output->get_w() + 1];
@@ -368,7 +368,7 @@ int OverlayFrame::transfer_scale_f(VFrame *output, VFrame *input_v, unsigned cha
float *yinput_fraction1;
float *yinput_fraction2;
float *yinput_fraction3;
- register int y_out, x_out, h_out, w_out, i;
+ int y_out, x_out, h_out, w_out, i;
xinput_pixel1 = new int[output->get_w() + 1];
xinput_pixel2 = new int[output->get_w() + 1];
@@ -466,7 +466,7 @@ int OverlayFrame::get_scale_array(int *column_table, int *row_table,
int out_x1, int out_y1, int out_x2, int out_y2)
{
int y_out;
- register int i;
+ int i;
float w_in = in_x2 - in_x1;
float h_in = in_y2 - in_y1;
int w_out = out_x2 - out_x1;
@@ -630,8 +630,9 @@ int OverlayFrame::transfer_row_direct(VPixel *output, VPixel *input, int out_col
{
float a_float;
a_float = (float)alpha / VMAX;
-
- for(register int i = 0; i < out_columns; i++)
+#pragma omp parallel num_threads(3)
+#pragma omp parallel for
+ for(int i = 0; i < out_columns; i++)
{
pixel_overlay->overlay_pixel_f(output[i], input[i], a_float);
}
@@ -642,7 +643,9 @@ int