On Wednesday 17 November 2010 07:47:39 Xu, Samuel wrote: > Hi, Soeren Sandmann and Siarhei Siamashka: > Glad to send out this refreshed patch to address points we discussed > for this SSSE3 patch.
Thanks. And sorry for a rather late reply. > In this new patch, we merged 32 bit and 64 bit asm code > into unified code (Macros are moved to header file, and make them more > reusable and easily to be read). Actually I'm a bit concerned about how the code looks now. There are still similar looking parts in 32-bit and 64-bit implementation which could be shared: +/* the macro definition of shift copy in stage one . 32bytes. + * Two 16-bytes XMM register will be packed into another 16-byte by using palignr + * instruction. */ +.macro SHL_COPY_STAGE_ONE x + movaps 16(%rsi), %xmm2 + sub $32, %rdx + movaps 32(%rsi), %xmm3 + lea 32(%rsi), %rsi + movdqa %xmm3, %xmm4 + palignr \x, %xmm2, %xmm3 + lea 32(%rdi), %rdi + palignr \x, %xmm1, %xmm2 + por %xmm6, %xmm2 + movaps %xmm2, -32(%rdi) + por %xmm6, %xmm3 + movaps %xmm3, -16(%rdi) +.endm and +/* the macro definition of shift copy in stage one . 32bytes */ +.macro SHL_COPY_STAGE_ONE x + movaps 16(%eax), %xmm2 + sub $32, %ecx + movaps 32(%eax), %xmm3 + lea 32(%eax), %eax + movdqa %xmm3, %xmm4 + palignr \x, %xmm2, %xmm3 + lea 32(%edx), %edx + palignr \x, %xmm1, %xmm2 + por %xmm6, %xmm2 + movaps %xmm2, -32(%edx) + por %xmm6, %xmm3 + movaps %xmm3, -16(%edx) +.endm The whole point of using macros is to make the code more maintainable, readable and easy to extend. Just splitting the code in chunks, putting them into macros with just the primary goal to reduce the total number of lines may actually have an opposite effect and result in a rather obfuscated sources instead. I don't see an obvious and clean way how to extend this code to support more compositing operations. After all, the implemented code is just only *one* of a huge variety of compositing operations supported by pixman. And this particular operation even should not be normally used much on properly written code. Basically, this is a simple conversion from x8r8g8b8 -> a8r8g8b8 format (by explicitly setting alpha channel value to 255, instead of having it undefined for x8r8g8b8). If it is done as an intermediate step somewhere in the rendering pipeline, then it's just a waste of cpu cycles (no matter if it is SSSE3 optimized or not, it should preferably be just eliminated completely). If some other compositing function expects a8r8g8b8 input, but is given x8r8g8b8, then this function can be trivially extended to also support x8r8g8b8, at worst by adding just a single OR instruction in it. And at best, it may even become faster after this change (only 3 color components need to be processed instead of 4). I already have mentioned it earlier regarding your use case [1]. Anyway, if the goal is to seriously improve pixman performance on Intel Atom, then a lot more different operations need to be optimized too. Another simple operation which is very similar to 'src_x888_8888' and still somewhat practically useful is 'add_8888_8888' as mentioned earlier in [2]. Your code does not seem to be ready to support it without major modifications. I have attached some simple example of providing unified support for both 32-bit and 64-bit systems and implementing multiple operations. Surely, this example is missing non-executable stack markers, cfe directives and does not really utilize SIMD to get good performance. But probably something like this may be used as a starting point? With all the bells and whistles added, eventually it might start to resemble current pixman ARM NEON code a lot. That said, I actually don't want to advocate any particular way of implementing this stuff (prefer dynamic JIT code generation or assembly macros, AT&T or Intel syntax, etc). It's just my vision about what features it should eventually have. But if your code works correctly, does not add maintenance headaches (the part responsible for interacting with the rest of pixman must be clean) and provides some measurable performance improvement, I would be fine with it too. After all, it could be always safely scrapped and replaced with something better if needed. But more about this later. > We also fixed issue like CPU detection/git log/make file checking/file > name...etc. That would be nice to have as a separate patch. It has much higher chances to be accepted and committed first. > For MOVD, we simplified the backward copy code since pervious code is too > long and not gain obvious performance, And this is what I'm worried about. First you proposed this part of code (backwards copy) without providing any initial explanation or benchmark numbers. And now you abandon it, also without providing any benchmark numbers or description of your test. But because you still have x86 instructions there instead of MOVD, store forwarding aliasing surely may affect performance, as demonstrated by my simple test program posted earlier [3] as part of the discussion. Can we be sure that this SSSE3 patch actually provides any practical performance improvement over the current pixman code? Considering that compared to your initial profiling data, now SSE2 optimized code is used for this operation and also problematic software prefetch which was almost halving [4] effective memory bandwidth for this operation is now eliminated [5]. Surely, using 'test/lowlevel-blt-bench' microbenchmark it showed good improvement for L1 cached data. But on the other hand, performance of small random operations has dropped (shown as 'R' statistics). Maybe that's happening because of the scanline function call overhead and this fwd/bwd special handling for small buffer sizes? So a final verification with some real practical use case would be interesting. As for the pixman code which is really worth optimizing on Intel Atom, it is 'over_8888_8888/over_8888_x8888' operation (function 'sse2_composite_over_8888_8888'). This is one of the most commonly used functions. And the microbenchmark numbers measured in millions of pixels per second when processing data in L1 cache, in L2 cache and in memory by using 'lowlevel-blt-bench' test program are the following: == Intel Atom N450 @1667MHz, DDR2-667 (64-bit) == add_8888_8888 = L1: 607.08 L2: 375.34 M:259.53 over_8888_x888 = L1: 123.73 L2: 117.10 M:113.56 over_8888_0565 = L1: 106.11 L2: 98.91 M: 99.07 == TI OMAP3430/3530, ARM Cortex-A8 @500MHz, LPDDR @166MHz (32-bit) == default build: add_8888_8888 = L1: 227.26 L2: 84.71 M: 44.54 over_8888_x888 = L1: 161.06 L2: 88.20 M: 44.86 over_8888_0565 = L1: 127.02 L2: 93.99 M: 61.25 software prefetch disabled (*): add_8888_8888 = L1: 351.44 L2: 97.29 M: 25.35 over_8888_x888 = L1: 168.72 L2: 95.04 M: 24.81 over_8888_0565 = L1: 128.06 L2: 98.96 M: 32.16 (*) ARM code uses relatively complex cross-scanline software prefetch intended to maximize the use of the available memory bandwidth (it's normally expected to be about as fast as the best available memcpy implementation, though it was tuned for early OMAP3 chips and may need to be updated for optimal performance on the other devices). This code is typically executed in a separate pipeline simultaneously with NEON instructions and has practically no overhead, but still becomes visible for extremely simple operations such as 'add'. It is interesting here that 'add_8888_8888' and 'over_8888_x888' are accessing memory in exactly the same way (read 32-bit pixel from both source and destination, write result to destination), while 'add' is much less computationally intensive than 'over'. On ARM, both 'add' and 'over' operations perform exactly the same when working with the data in memory, which shows that they are memory bandwidth limited. While on Atom 'over' is clearly limited by CPU performance and this is something which can be possibly improved. Surely Atom still wins against ARM in absolute numbers for 'over' operation, but it probably can do a lot better than this. Another note: this microbenchmark also shows that there may be little practical gain in optimizing 'add_8888_8888' operation as it is already memory bandwidth limited. 1. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00357.html 2. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00378.html 3. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00415.html 4. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00404.html 5. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00563.html -- Best regards, Siarhei Siamashka
.intel_syntax noprefix .altmacro /* x86/x86-64 abstraction (linux only) */ #ifdef __amd64__ .macro ENTRANCE .set DST_PTR, rdi .set SRC_PTR, rsi .set PIXEL_COUNTER, rdx .endm #else .macro ENTRANCE mov eax, [esp + 4] mov ecx, [esp + 8] mov edx, [esp + 12] .set DST_PTR, eax .set SRC_PTR, ecx .set PIXEL_COUNTER, edx .endm #endif /* scanline function generator macro */ .macro generate_scanline_function function_name, prepare, composite, need_dst=0 .global function_name function_name: ENTRANCE lea SRC_PTR, [SRC_PTR + PIXEL_COUNTER * 4] lea DST_PTR, [DST_PTR + PIXEL_COUNTER * 4] neg PIXEL_COUNTER prepare 1: /* load source pixel */ movd xmm0, dword ptr [SRC_PTR + PIXEL_COUNTER * 4] /* load destination pixel if needed */ .if need_dst movd xmm1, dword ptr [DST_PTR + PIXEL_COUNTER * 4] .endif /* do something */ composite /* write back the result */ movd dword ptr [DST_PTR + PIXEL_COUNTER * 4], xmm0 add PIXEL_COUNTER, 1 js 1b ret .endm /* * instantiate "void simple_copy32_or(int *dst, int *src, int number_of_pixels)" function * (performs '*dst = *src | 0xFF000000' operation on pixels) */ cff000000: .long 0xFF000000 .macro prepare_simple_copy32_or movd xmm2, cff000000 .endm .macro composite_simple_copy32_or por xmm0, xmm2 .endm generate_scanline_function simple_copy32_or, prepare_simple_copy32_or, composite_simple_copy32_or /* * instantiate "void simple_copy32_add(int *dst, int *src, int number_of_pixels)" function * (performs '*dst = unsigned_saturated_8bit_vector_add(*dst, *src)' operation on pixels) */ .macro prepare_simple_copy32_add .endm .macro composite_simple_copy32_add paddusb xmm0, xmm1 .endm generate_scanline_function simple_copy32_add, prepare_simple_copy32_add, composite_simple_copy32_add, 1
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman