Re: [Pixman] [PATCH/RFC] Use OpenMP for bilinear scaled fast paths
Søren Sandmann sandm...@cs.au.dk writes: The main concern from me is making sure that it doesn't cause issues in the X server, which is known to do wacky things with signals and possibly threads. But the answer to that is to just put it in and get it tested. In some limited testing of this patch, I found that: - It did indeed cause crashes in the input system with the X server that was in Fedora 14. I think these are known bugs that have been fixed in newer X servers. (Should we care whether we trigger bugs in older X servers?) - With the X server in Fedora 17 it does not cause crashes. - When I go to http://ie.microsoft.com/testdrive/Performance/FishIETank/ the X server will max out 3.5 cores and firefox will use the remaining half core, but judging from looking at the fish and the page's FPS meter, the performance isn't actually better. Profiling shows that 50% to 75% of the time is spent in a function in libgomp.so called something like gomp_wait_for_barrier(). Søren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH/RFC] Use OpenMP for bilinear scaled fast paths
On Wed, Jun 27, 2012 at 4:53 AM, Søren Sandmann sandm...@cs.au.dk wrote: Søren Sandmann sandm...@cs.au.dk writes: The main concern from me is making sure that it doesn't cause issues in the X server, which is known to do wacky things with signals and possibly threads. But the answer to that is to just put it in and get it tested. In some limited testing of this patch, I found that: - It did indeed cause crashes in the input system with the X server that was in Fedora 14. I think these are known bugs that have been fixed in newer X servers. (Should we care whether we trigger bugs in older X servers?) - With the X server in Fedora 17 it does not cause crashes. - When I go to http://ie.microsoft.com/testdrive/Performance/FishIETank/ the X server will max out 3.5 cores and firefox will use the remaining half core, but judging from looking at the fish and the page's FPS meter, the performance isn't actually better. Profiling shows that 50% to 75% of the time is spent in a function in libgomp.so called something like gomp_wait_for_barrier(). By quickly searching for gomp_wait_for_barrier references on the Internet, this sounds like OMP_WAIT_POLICY [1] might be not set to PASSIVE and the threads which have finished their job before the others are just spinning. I'm also forcing static scheduling via schedule clause which may also contribute to this problem (I thought that dynamic scheduling might be a bad idea and cause higher overhead for smaller images). And there is if clause in omp pragma, which can be used to avoid multi-threaded processing for the cases where it performs poorly (very small images). This stuff may need a lot of tuning to ensure that OpenMP is always a gain and never a loss. [1] http://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH/RFC] Use OpenMP for bilinear scaled fast paths
On Mon, 25 Jun 2012 02:00:27 +0300, Siarhei Siamashka siarhei.siamas...@gmail.com wrote: Does it actually make sense? I remember somebody was strongly opposing the idea of spawning threads in pixman in the past, but can't find this e-mail right now. The only caveat from my point of view is that pixman_image_composite() must be atomic as the current cairo_image_surface_t is meant to be synchronous. Or at least API added so that I can serialise the operations within cairo_image_surface_t. In the past, I believe we've suggested grander schemes that that would require us to expose the asynchronous nature to the user. However, simply using OpenMP to parallise the kernels should not leak across the interface and so it is acceptable. So it just boils down to whether this make maintenance harder and interferes with future plans... Is there a way to hint to OpenMP how many threads to use? As we know the memory characteristics for most of the routines, do we not want to hint to OMP not to use more threads than required to saturate memory bw? If it was able to automatically fine tune itself, could we then not open up more kernels for parallelisation? (Granted the scaling loops have the worst performance characteristics, not even rivalling the single-threaded performance of skia.) Otherwise it's a big win for such a tiny patch! Just need to cross-check that we don't introduce regression on the older single-core no-cache chips. :( Siarhei, just one more thing to consider: tiling. :) -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH/RFC] Use OpenMP for bilinear scaled fast paths
Chris Wilson ch...@chris-wilson.co.uk writes: On Mon, 25 Jun 2012 02:00:27 +0300, Siarhei Siamashka siarhei.siamas...@gmail.com wrote: Does it actually make sense? I remember somebody was strongly opposing the idea of spawning threads in pixman in the past, but can't find this e-mail right now. You may be remembering an IRC discussion about it, where Joonas was opposed to libraries spawning threads: http://people.freedesktop.org/~sandmann/joonas-threads The only caveat from my point of view is that pixman_image_composite() must be atomic as the current cairo_image_surface_t is meant to be synchronous. Or at least API added so that I can serialise the The main concern from me is making sure that it doesn't cause issues in the X server, which is known to do wacky things with signals and possibly threads. But the answer to that is to just put it in and get it tested. operations within cairo_image_surface_t. In the past, I believe we've suggested grander schemes that that would require us to expose the asynchronous nature to the user. However, simply using OpenMP to parallise the kernels should not leak across the interface and so it is acceptable. So it just boils down to whether this make maintenance harder and interferes with future plans... At some point, I think grander schemes will be useful, where grander scheme might mean rolling our own thread pool and/or adding an asynchronous API to pixman. One case is radial gradients. These are generated through iterators, and I am not sure that OpenMP is up to the task of parallelizing those. That is, it doesn't seem likely that OpenMP can deal with code like this: iter_init (src_iter, height); iter_init (dest_iter, height); for (i = 0; i height; ++i) { iter_fetch (src_iter); iter_fetch (dest_iter); combine (); iter_write (dest_iter); } But that doesn't mean that OpenMP can't be used for the tings that it will deal with. Is there a way to hint to OpenMP how many threads to use? As we know the memory characteristics for most of the routines, do we not want to hint to OMP not to use more threads than required to saturate memory bw? We know the memory characteristics, but the arithmetic characteristics are less predictable. If some operation is doing a lot of arithmetic, we want more threads for it. What would be the performance impact of just parallelizing as much as possible? I suppose if one thread can saturate the memory bandwidth, having more threads would just pointlessly occopy more cores that could be used for other purposes. I don't know how much of a concern that actually is though. I suppose a JIT compiler might be able to make an estimate of the number of cycles per cache line accessed for the code it generated. Otherwise it's a big win for such a tiny patch! Just need to cross-check that we don't introduce regression on the older single-core no-cache chips. :( Even if it is a small performance regression on single-core chips, I still think it's worth it. Single-core chips are quickly becoming a thing of the past, and we could offer a --disable-omp configure argument for embedded systems where the CPU is known to be single-core ahead of time. Soren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH/RFC] Use OpenMP for bilinear scaled fast paths
Does it actually make sense? I remember somebody was strongly opposing the idea of spawning threads in pixman in the past, but can't find this e-mail right now. Even if using multithreaded rendering is acceptable, the next question is whether to rely on OpenMP for it. Currently OpenMP is disabled in Android toolchain by default: https://groups.google.com/forum/#!topic/android-ndk/pUfqxURgNbQ Clang/LLVM does not support OpenMP either. Some benchmarks with cairo-perf-trace (gcc 4.7.1, CFLAGS=-O2 -fopenmp): === Core i7 860 @2.8GHz === before patch: [ 0]image firefox-fishtank 66.912 66.931 0.13%3/3 export OMP_NUM_THREADS=1 [ 0]image firefox-fishtank 67.285 67.393 0.12%3/3 export OMP_NUM_THREADS=2 [ 0]image firefox-fishtank 40.156 40.192 0.07%3/3 export OMP_NUM_THREADS=3 [ 0]image firefox-fishtank 31.152 31.241 0.21%3/3 export OMP_NUM_THREADS=4 [ 0]image firefox-fishtank 26.507 26.540 0.15%3/3 === Radeon HD 6770 (xf86-video-ati-6.14.4, Mesa 8.1-devel (git-6e7756d)) [ 0] xlib firefox-fishtank 34.135 34.156 0.23%3/3 [ 0] gl firefox-fishtank5.6715.755 0.89%3/3 --- pixman/pixman-inlines.h | 24 +++- 1 files changed, 15 insertions(+), 9 deletions(-) diff --git a/pixman/pixman-inlines.h b/pixman/pixman-inlines.h index 3532867..7ba0d09 100644 --- a/pixman/pixman-inlines.h +++ b/pixman/pixman-inlines.h @@ -765,6 +765,14 @@ bilinear_pad_repeat_get_scanline_bounds (int32_t source_image_width, * range and can fit into unsigned byte or be used with 8-bit SIMD * multiplication instructions. */ + +#define OMP_BILINEAR_PARALLEL_FOR _Pragma(omp parallel for default(none) \ + firstprivate(height,dst_line,dst_stride,unit_y,unit_x,src_first_line, \ +src_stride,max_vx,right_pad,left_pad,left_tz,right_tz,src_width, \ +src_width_fixed,src_image,need_src_extension,mask_line, \ +mask_stride,v,vy,width) \ + private(vx,y1,y2,mask) schedule(static) if(height 1)) + #define FAST_BILINEAR_MAINLOOP_INT(scale_func_name, scanline_func, src_type_t, mask_type_t,\ dst_type_t, repeat_mode, flags) \ static void \ @@ -782,7 +790,7 @@ fast_composite_scaled_bilinear ## scale_func_name (pixman_implementation_t *imp, pixman_fixed_t unit_x, unit_y; \ int32_t left_pad, left_tz, right_tz, right_pad; \ \ -dst_type_t *dst; \ +int i; \ mask_type_t solid_mask; \ const mask_type_t *mask = solid_mask; \ int src_stride, mask_stride, dst_stride; \ @@ -864,20 +872,19 @@ fast_composite_scaled_bilinear ## scale_func_name (pixman_implementation_t *imp, src_width_fixed = pixman_int_to_fixed (src_width); \ } \ \ -while (--height = 0) \ +OMP_BILINEAR_PARALLEL_FOR \ +for (i = 0; i height; i++) \ { \ int weight1, weight2; \ - dst = dst_line; \ - dst_line += dst_stride; \ + dst_type_t *dst = dst_line + (uintptr_t)dst_stride * i; \ vx = v.vector[0]; \ if (flags FLAG_HAVE_NON_SOLID_MASK) \ { \ - mask = mask_line;
Re: [Pixman] [PATCH/RFC] Use OpenMP for bilinear scaled fast paths
On Mon, Jun 25, 2012 at 2:00 AM, Siarhei Siamashka siarhei.siamas...@gmail.com wrote: Some benchmarks with cairo-perf-trace (gcc 4.7.1, CFLAGS=-O2 -fopenmp): === Core i7 860 @2.8GHz === before patch: [ 0] image firefox-fishtank 66.912 66.931 0.13% 3/3 export OMP_NUM_THREADS=1 [ 0] image firefox-fishtank 67.285 67.393 0.12% 3/3 export OMP_NUM_THREADS=2 [ 0] image firefox-fishtank 40.156 40.192 0.07% 3/3 export OMP_NUM_THREADS=3 [ 0] image firefox-fishtank 31.152 31.241 0.21% 3/3 export OMP_NUM_THREADS=4 [ 0] image firefox-fishtank 26.507 26.540 0.15% 3/3 === Radeon HD 6770 (xf86-video-ati-6.14.4, Mesa 8.1-devel (git-6e7756d)) [ 0] xlib firefox-fishtank 34.135 34.156 0.23% 3/3 [ 0] gl firefox-fishtank 5.671 5.755 0.89% 3/3 Almost forgot, the benchmarks would have been incomplete without also trying LLVMpipe: $ export LIBGL_ALWAYS_SOFTWARE=1 $ export CAIRO_TEST_TARGET=gl $ cairo/perf/cairo-perf-trace -i3 cairo-traces/benchmark/firefox-fishtank.trace [ # ] backend test min(s) median(s) stddev. count [ # ] gl: VMware, Inc. Gallium 0.4 on llvmpipe (LLVM 0x301) 2.1 Mesa 8.1-devel (git-6e7756d) [ 0] gl firefox-fishtank 112.933 113.604 0.32%3/3 -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman