On 05/29/2015 08:45 AM, Siarhei Siamashka wrote:
Sorry, I did not explain it properly earlier. I did not mean
to replicate the whole 'pix_multiply' function in the little
endian code path.
No worries. What I did felt odd indeed. Let us see :)


The point is that we need to take care of some differences
between big and little endian powerpc variants. And we want
to isolate these differences in the smallest possible unit,
while still keeping the code readable and maintainable.

In your first patch, you treated the vec_mergel/vec_mergeh
intrinsics as such smallest units and replaced them with the
MERGEL/MERGEH macros. It works correctly in this context, but
my complaint was that the code is not very clean this way. If
the vec_mergel/vec_mergeh intrinsics were supposed to be endian
agnostic and behave like the MERGEL/MERGEH macros, then I
guess that the vec_mergel/vec_mergeh intrinsics would have
been implemented this way in the first place. Redefining the
meaning of the standard intrinsics looks ugly. Basically, the
individual intrinsics are too small for using them for isolating
big/little endian differences in a non-controversial way.
Got it. Yes, the problem is not actually what the intrinsics themselves do, but
how the vector settles when looking at it byte wise.

Now in this new patch the whole 'pix_multiply' function is
replicated. But it is too large. And duplicating a lot of
code is not nice either.

What I actually wanted to see was a new static force_inline function.
This function can isolate the big/little endian differences, be
reasonably small and do something meaningful by itself. For example,
the SSE2 code has the 'unpack_128_2x128' function for this:

     
http://cgit.freedesktop.org/pixman/tree/pixman/pixman-sse2.c?id=pixman-0.32.6#n68

     static force_inline void
     unpack_128_2x128 (__m128i data, __m128i* data_lo, __m128i* data_hi)
     {
         *data_lo = _mm_unpacklo_epi8 (data, _mm_setzero_si128 ());
         *data_hi = _mm_unpackhi_epi8 (data, _mm_setzero_si128 ());
     }

This function takes a single 128-bit vector with 8-bit elements as
the input argument and produces two 128-bit vectors (low and high part)
with 16-bit elements as the output. The code is simple and it is easy
to see what it does.

It would be great to have something similar in the powerpc code too.
The 'force_inline' attribute ensures that the compiler inlines this
function, so there is no call overhead. You can (and are also
encouraged to) verify that the performance is not regressed by
using the 'lowlevel-blt-bench' program in the 'test' directory.
Right, I will try to elaborate a solution to that point here.

Thanks for the review!
--

Fernando Seiti Furusato
IBM Linux Technology Center

_______________________________________________
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman

Reply via email to