On 05/29/2015 08:45 AM, Siarhei Siamashka wrote:
Sorry, I did not explain it properly earlier. I did not mean to replicate the whole 'pix_multiply' function in the little endian code path.
No worries. What I did felt odd indeed. Let us see :)
The point is that we need to take care of some differences between big and little endian powerpc variants. And we want to isolate these differences in the smallest possible unit, while still keeping the code readable and maintainable. In your first patch, you treated the vec_mergel/vec_mergeh intrinsics as such smallest units and replaced them with the MERGEL/MERGEH macros. It works correctly in this context, but my complaint was that the code is not very clean this way. If the vec_mergel/vec_mergeh intrinsics were supposed to be endian agnostic and behave like the MERGEL/MERGEH macros, then I guess that the vec_mergel/vec_mergeh intrinsics would have been implemented this way in the first place. Redefining the meaning of the standard intrinsics looks ugly. Basically, the individual intrinsics are too small for using them for isolating big/little endian differences in a non-controversial way.
Got it. Yes, the problem is not actually what the intrinsics themselves do, but how the vector settles when looking at it byte wise.
Now in this new patch the whole 'pix_multiply' function is replicated. But it is too large. And duplicating a lot of code is not nice either. What I actually wanted to see was a new static force_inline function. This function can isolate the big/little endian differences, be reasonably small and do something meaningful by itself. For example, the SSE2 code has the 'unpack_128_2x128' function for this: http://cgit.freedesktop.org/pixman/tree/pixman/pixman-sse2.c?id=pixman-0.32.6#n68 static force_inline void unpack_128_2x128 (__m128i data, __m128i* data_lo, __m128i* data_hi) { *data_lo = _mm_unpacklo_epi8 (data, _mm_setzero_si128 ()); *data_hi = _mm_unpackhi_epi8 (data, _mm_setzero_si128 ()); } This function takes a single 128-bit vector with 8-bit elements as the input argument and produces two 128-bit vectors (low and high part) with 16-bit elements as the output. The code is simple and it is easy to see what it does. It would be great to have something similar in the powerpc code too. The 'force_inline' attribute ensures that the compiler inlines this function, so there is no call overhead. You can (and are also encouraged to) verify that the performance is not regressed by using the 'lowlevel-blt-bench' program in the 'test' directory.
Right, I will try to elaborate a solution to that point here. Thanks for the review! -- Fernando Seiti Furusato IBM Linux Technology Center _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman