On Thu, 22 Jan 2015 04:58 AM, Carsten Haitzler <ras...@rasterman.com> said:
> could you send the patch needed for this change? i like the theoretical
number improvement.
> what i want to know is how invasive is it to the code.

--- a/configure.ac
+++ b/configure.ac
@@ -1666,6 +1666,12 @@ if test "x${have_pixman}" = "xyes" ; then
    EFL_ADD_FEATURE([EVAS_PIXMAN], [scale_sample],
[${have_pixman_scale_sample}])
 fi
 
+### Check for OpenMP ###
+## --disable-openmp option will be added to configure script incase openmp
is detected
+AC_OPENMP
+EFL_ADD_CFLAGS([EVAS], [$OPENMP_CFLAGS])
+EFL_ADD_LIBS([EVAS], [$OPENMP_CFLAGS])
+
 ## Engines
 
 define([EVAS_ENGINE_DEP_CHECK_FB], [
diff --git a/src/lib/evas/common/evas_scale_sample.c
b/src/lib/evas/common/evas_scale_sample.c
index 91a75eb..f97b71b 100644
--- a/src/lib/evas/common/evas_scale_sample.c
+++ b/src/lib/evas/common/evas_scale_sample.c
@@ -587,14 +586,14 @@ scale_rgba_in_to_out_clip_sample_internal(RGBA_Image
*src, RGBA_Image *dst,
         else
 #endif
           {
-             /* a scanline buffer */
-             buf = alloca(dst_clip_w * sizeof(DATA32));
-
              /* image masking */
              if (dc->clip.mask)
                {
                   RGBA_Image *im = dc->clip.mask;
 
+                  /* a scanline buffer */
+                  buf = alloca(dst_clip_w * sizeof(DATA32));
+
                   for (y = 0; y < dst_clip_h; y++)
                     {
                        dst_ptr = buf;
@@ -618,6 +617,28 @@ scale_rgba_in_to_out_clip_sample_internal(RGBA_Image
*src, RGBA_Image *dst,
                }
              else
                {
+#if _OPENMP
+                  buf = malloc(dst_clip_w * dst_clip_h * sizeof(DATA32));
+                  if (!buf) return;
+
+                  #pragma omp parallel for private(dst_ptr, x, ptr)
+                  for (y = 0; y < dst_clip_h; y++)
+                    {
+                       dst_ptr = buf + (dst_clip_w * y);
+                       for (x = 0; x < dst_clip_w; x++)
+                         {
+                            ptr = row_ptr[y] + lin_ptr[x];
+                            *dst_ptr = *ptr;
+                             dst_ptr++;
+                         }
+                       /* * blend here [clip_w *] buf -> dst_ptr * */
+                       func((dst_ptr - dst_clip_w), NULL, dc->mul.col,
(dptr + (y * dst_w)), dst_clip_w);
+                    }
+                  if (buf) free(buf);
+#else
+                  /* a scanline buffer */
+                  buf = alloca(dst_clip_w * sizeof(DATA32));
+
                   for (y = 0; y < dst_clip_h; y++)
                     {
                        dst_ptr = buf;
@@ -633,6 +654,7 @@ scale_rgba_in_to_out_clip_sample_internal(RGBA_Image
*src, RGBA_Image *dst,
 
                        dptr += dst_w;
                     }
+#endif
                }
           }
      }

On Wed, 21 January, 2015 07:19 PM, Cedric BAIL <cedric.b...@free.fr> said:

> Could you provide before and after full result from expedite. We had in
the past
> some parallelization code and it ended up triggering slower path for most
case.
> It also resulted in consuming massively more CPU with little improvement
for the
> few tests case where we could see an improvement.

  Image Blend Nearest Scaled:  127.40    191.03 ( +49.9%)

Above was the expedite-cmp results on I3. All the other cases reported
change of < +/- 5 %
Please note that I ran expedite with -c(count) of 10000. Otherwise small
test-cases like
"Textblock Intl" gives varying number. Above patch affected only "Image
Blend Nearest Scaled"
test-case which showed consistent improvement of ~48%

> Basically, if by adding 1 core, you gain 30% and now both core are running
at 100%,
> we are not sure it is a good tradeoff. But if by having 3 more cores, you
only get 48% speed
> increase and all core are running at 100%, then we are sure it is not a
wise choice from an energy perspective.

You can expect speedup linear to 'no. of cores' when the load is
considerably huge. But in case of evas/expedite
we scale small/medium size image mostly with small clip areas. I believe
this is the reason for less speedup.
And yes, we need to definitely benchmark energy consumption in this case

> We need extensive benchmark on speed, memory and battery consumption for
this move.
> One of the change we can investigate is to do that manually and tweak it
to only start a meaningful
> number of core depending on the task. In many case the main issue is
memory bandwidth and we
> can maybe have some gain first by doing light compression on some more
data. Anyway, it's an area that
> require a lot of experimentation and data to move forward. We are
definitively interested by more information.

Yes, we can tweak number of threads using openmp apis. The above patch (and
some other experimental patch)
behaves very differently on an arm soc compared to a desktop pc. We may have
to selectively enable some
optimizations based on arch

Thanks,
Krishnaraj



------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

Reply via email to