Re: [Pixman] [PATCH/RFC] Use OpenMP for bilinear scaled fast paths

2012-06-26 Thread Søren Sandmann
Søren Sandmann sandm...@cs.au.dk writes:

 The main concern from me is making sure that it doesn't cause issues in
 the X server, which is known to do wacky things with signals and
 possibly threads. But the answer to that is to just put it in and get it
 tested.

In some limited testing of this patch, I found that:

- It did indeed cause crashes in the input system with the X server that
  was in Fedora 14. I think these are known bugs that have been fixed in
  newer X servers. (Should we care whether we trigger bugs in older X
  servers?)

- With the X server in Fedora 17 it does not cause crashes.

- When I go to 

http://ie.microsoft.com/testdrive/Performance/FishIETank/

  the X server will max out 3.5 cores and firefox will use the remaining
  half core, but judging from looking at the fish and the page's FPS
  meter, the performance isn't actually better.

  Profiling shows that 50% to 75% of the time is spent in a function in
  libgomp.so called something like gomp_wait_for_barrier().


Søren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH/RFC] Use OpenMP for bilinear scaled fast paths

2012-06-26 Thread Siarhei Siamashka
On Wed, Jun 27, 2012 at 4:53 AM, Søren Sandmann sandm...@cs.au.dk wrote:
 Søren Sandmann sandm...@cs.au.dk writes:

 The main concern from me is making sure that it doesn't cause issues in
 the X server, which is known to do wacky things with signals and
 possibly threads. But the answer to that is to just put it in and get it
 tested.

 In some limited testing of this patch, I found that:

 - It did indeed cause crashes in the input system with the X server that
  was in Fedora 14. I think these are known bugs that have been fixed in
  newer X servers. (Should we care whether we trigger bugs in older X
  servers?)

 - With the X server in Fedora 17 it does not cause crashes.

 - When I go to

    http://ie.microsoft.com/testdrive/Performance/FishIETank/

  the X server will max out 3.5 cores and firefox will use the remaining
  half core, but judging from looking at the fish and the page's FPS
  meter, the performance isn't actually better.

  Profiling shows that 50% to 75% of the time is spent in a function in
  libgomp.so called something like gomp_wait_for_barrier().

By quickly searching for gomp_wait_for_barrier references on the
Internet, this sounds like OMP_WAIT_POLICY [1] might be not set to
PASSIVE and the threads which have finished their job before the
others are just spinning. I'm also forcing static scheduling via
schedule clause which may also contribute to this problem (I thought
that dynamic scheduling might be a bad idea and cause higher overhead
for smaller images). And there is if clause in omp pragma, which can
be used to avoid multi-threaded processing for the cases where it
performs poorly (very small images). This stuff may need a lot of
tuning to ensure that OpenMP is always a gain and never a loss.

[1] http://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH/RFC] Use OpenMP for bilinear scaled fast paths

2012-06-25 Thread Chris Wilson
On Mon, 25 Jun 2012 02:00:27 +0300, Siarhei Siamashka 
siarhei.siamas...@gmail.com wrote:
 Does it actually make sense? I remember somebody was strongly opposing
 the idea of spawning threads in pixman in the past, but can't find
 this e-mail right now.

The only caveat from my point of view is that pixman_image_composite()
must be atomic as the current cairo_image_surface_t is meant to be
synchronous. Or at least API added so that I can serialise the
operations within cairo_image_surface_t. In the past, I believe we've
suggested grander schemes that that would require us to expose the
asynchronous nature to the user. However, simply using OpenMP to
parallise the kernels should not leak across the interface and so it is
acceptable. So it just boils down to whether this make maintenance
harder and interferes with future plans...

Is there a way to hint to OpenMP how many threads to use? As we know the
memory characteristics for most of the routines, do we not want to hint to
OMP not to use more threads than required to saturate memory bw? If it
was able to automatically fine tune itself, could we then not open up
more kernels for parallelisation? (Granted the scaling loops have the
worst performance characteristics, not even rivalling the
single-threaded performance of skia.)

Otherwise it's a big win for such a tiny patch! Just need to cross-check
that we don't introduce regression on the older single-core no-cache
chips. :(

Siarhei, just one more thing to consider: tiling. :)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH/RFC] Use OpenMP for bilinear scaled fast paths

2012-06-25 Thread Søren Sandmann
Chris Wilson ch...@chris-wilson.co.uk writes:

 On Mon, 25 Jun 2012 02:00:27 +0300, Siarhei Siamashka 
 siarhei.siamas...@gmail.com wrote:
 Does it actually make sense? I remember somebody was strongly opposing
 the idea of spawning threads in pixman in the past, but can't find
 this e-mail right now.

You may be remembering an IRC discussion about it, where Joonas was
opposed to libraries spawning threads:

http://people.freedesktop.org/~sandmann/joonas-threads

 The only caveat from my point of view is that pixman_image_composite()
 must be atomic as the current cairo_image_surface_t is meant to be
 synchronous. Or at least API added so that I can serialise the

The main concern from me is making sure that it doesn't cause issues in
the X server, which is known to do wacky things with signals and
possibly threads. But the answer to that is to just put it in and get it
tested.

 operations within cairo_image_surface_t. In the past, I believe we've
 suggested grander schemes that that would require us to expose the
 asynchronous nature to the user. However, simply using OpenMP to
 parallise the kernels should not leak across the interface and so it is
 acceptable. So it just boils down to whether this make maintenance
 harder and interferes with future plans...

At some point, I think grander schemes will be useful, where grander
scheme might mean rolling our own thread pool and/or adding an
asynchronous API to pixman.

One case is radial gradients. These are generated through iterators, and
I am not sure that OpenMP is up to the task of parallelizing those. That
is, it doesn't seem likely that OpenMP can deal with code like this:

   iter_init (src_iter, height);
   iter_init (dest_iter, height);
   for (i = 0; i  height; ++i)
   {
   iter_fetch (src_iter);
   iter_fetch (dest_iter);
   combine ();
   iter_write (dest_iter);
   }

But that doesn't mean that OpenMP can't be used for the tings that it
will deal with.

 Is there a way to hint to OpenMP how many threads to use? As we know the
 memory characteristics for most of the routines, do we not want to hint to
 OMP not to use more threads than required to saturate memory bw?

We know the memory characteristics, but the arithmetic characteristics
are less predictable. If some operation is doing a lot of arithmetic, we
want more threads for it.

What would be the performance impact of just parallelizing as much as
possible? I suppose if one thread can saturate the memory bandwidth,
having more threads would just pointlessly occopy more cores that could
be used for other purposes. I don't know how much of a concern that
actually is though.

I suppose a JIT compiler might be able to make an estimate of the number
of cycles per cache line accessed for the code it generated.

 Otherwise it's a big win for such a tiny patch! Just need to cross-check
 that we don't introduce regression on the older single-core no-cache
 chips. :(

Even if it is a small performance regression on single-core chips, I
still think it's worth it. Single-core chips are quickly becoming a
thing of the past, and we could offer a --disable-omp configure argument
for embedded systems where the CPU is known to be single-core ahead of
time.


Soren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH/RFC] Use OpenMP for bilinear scaled fast paths

2012-06-24 Thread Siarhei Siamashka
Does it actually make sense? I remember somebody was strongly opposing
the idea of spawning threads in pixman in the past, but can't find
this e-mail right now.

Even if using multithreaded rendering is acceptable, the next question is
whether to rely on OpenMP for it. Currently OpenMP is disabled in Android
toolchain by default:
https://groups.google.com/forum/#!topic/android-ndk/pUfqxURgNbQ
Clang/LLVM does not support OpenMP either.

Some benchmarks with cairo-perf-trace (gcc 4.7.1, CFLAGS=-O2 -fopenmp):

=== Core i7 860 @2.8GHz ===

before patch:
[  0]image firefox-fishtank   66.912   66.931   0.13%3/3

export OMP_NUM_THREADS=1
[  0]image firefox-fishtank   67.285   67.393   0.12%3/3

export OMP_NUM_THREADS=2
[  0]image firefox-fishtank   40.156   40.192   0.07%3/3

export OMP_NUM_THREADS=3
[  0]image firefox-fishtank   31.152   31.241   0.21%3/3

export OMP_NUM_THREADS=4
[  0]image firefox-fishtank   26.507   26.540   0.15%3/3

=== Radeon HD 6770 (xf86-video-ati-6.14.4, Mesa 8.1-devel (git-6e7756d)) 

[  0] xlib firefox-fishtank   34.135   34.156   0.23%3/3
[  0]   gl firefox-fishtank5.6715.755   0.89%3/3

---
 pixman/pixman-inlines.h |   24 +++-
 1 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/pixman/pixman-inlines.h b/pixman/pixman-inlines.h
index 3532867..7ba0d09 100644
--- a/pixman/pixman-inlines.h
+++ b/pixman/pixman-inlines.h
@@ -765,6 +765,14 @@ bilinear_pad_repeat_get_scanline_bounds (int32_t 
source_image_width,
  *   range and can fit into unsigned byte or be used with 8-bit SIMD
  *   multiplication instructions.
  */
+
+#define OMP_BILINEAR_PARALLEL_FOR _Pragma(omp parallel for default(none)  
\
+ firstprivate(height,dst_line,dst_stride,unit_y,unit_x,src_first_line, 
\
+src_stride,max_vx,right_pad,left_pad,left_tz,right_tz,src_width,   
\
+src_width_fixed,src_image,need_src_extension,mask_line,
\
+mask_stride,v,vy,width)
\
+ private(vx,y1,y2,mask) schedule(static) if(height  1))
+
 #define FAST_BILINEAR_MAINLOOP_INT(scale_func_name, scanline_func, src_type_t, 
mask_type_t,\
  dst_type_t, repeat_mode, flags)   
\
 static void
\
@@ -782,7 +790,7 @@ fast_composite_scaled_bilinear ## scale_func_name 
(pixman_implementation_t *imp,
 pixman_fixed_t unit_x, unit_y; 
\
 int32_t left_pad, left_tz, right_tz, right_pad;
\

\
-dst_type_t *dst;   
\
+int i; 
\
 mask_type_t solid_mask;
\
 const mask_type_t *mask = solid_mask; 
\
 int src_stride, mask_stride, dst_stride;   
\
@@ -864,20 +872,19 @@ fast_composite_scaled_bilinear ## scale_func_name 
(pixman_implementation_t *imp,
src_width_fixed = pixman_int_to_fixed (src_width);  
\
 }  
\

\
-while (--height = 0)  
\
+OMP_BILINEAR_PARALLEL_FOR  
\
+for (i = 0; i  height; i++)   
\
 {  
\
int weight1, weight2;   
\
-   dst = dst_line; 
\
-   dst_line += dst_stride; 
\
+   dst_type_t *dst = dst_line + (uintptr_t)dst_stride * i; 
\
vx = v.vector[0];   
\
if (flags  FLAG_HAVE_NON_SOLID_MASK)   
\
{   
\
-   mask = mask_line;   

Re: [Pixman] [PATCH/RFC] Use OpenMP for bilinear scaled fast paths

2012-06-24 Thread Siarhei Siamashka
On Mon, Jun 25, 2012 at 2:00 AM, Siarhei Siamashka
siarhei.siamas...@gmail.com wrote:
 Some benchmarks with cairo-perf-trace (gcc 4.7.1, CFLAGS=-O2 -fopenmp):

 === Core i7 860 @2.8GHz ===

 before patch:
 [  0]    image             firefox-fishtank   66.912   66.931   0.13%    3/3

 export OMP_NUM_THREADS=1
 [  0]    image             firefox-fishtank   67.285   67.393   0.12%    3/3

 export OMP_NUM_THREADS=2
 [  0]    image             firefox-fishtank   40.156 40.192   0.07%    3/3

 export OMP_NUM_THREADS=3
 [  0]    image             firefox-fishtank   31.152   31.241   0.21%    3/3

 export OMP_NUM_THREADS=4
 [  0]    image             firefox-fishtank   26.507   26.540   0.15%    3/3

 === Radeon HD 6770 (xf86-video-ati-6.14.4, Mesa 8.1-devel (git-6e7756d)) 

 [  0]     xlib             firefox-fishtank   34.135   34.156   0.23%    3/3
 [  0]       gl             firefox-fishtank    5.671    5.755   0.89%    3/3

Almost forgot, the benchmarks would have been incomplete without also
trying LLVMpipe:

$ export LIBGL_ALWAYS_SOFTWARE=1
$ export CAIRO_TEST_TARGET=gl
$ cairo/perf/cairo-perf-trace -i3 cairo-traces/benchmark/firefox-fishtank.trace

[ # ]  backend test   min(s) median(s) stddev. count
[ # ]   gl: VMware, Inc. Gallium 0.4 on llvmpipe (LLVM 0x301) 2.1
Mesa 8.1-devel (git-6e7756d)
[  0]   gl firefox-fishtank  112.933  113.604   0.32%3/3

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman