Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
I have a similar performance result got about one month ago. Gvim test has great performance increase, about 360%. Other tests has no side effect and no increase neither. I tested it on 32-bit userland. I will test it again based on the git code and give out the data later. Regards, Xinyun On Fri, Dec 03, 2010 at 12:00:22AM +0800, Siarhei Siamashka wrote: > Just did some benchmarks with the 'gnome-system-monitor' cairo perf trace > which happens to use 'src_x888_' operation. It looks like SSSE3 has > about the same performance as SSE2, providing no measurable benefits on > this use case. The low complexity of this operation and the hardware > prefetcher make any differences between implementations much less > noticeable in general. > > Intel Atom N450, 64-bit userland, gcc 4.5.1 > > $ CAIRO_TEST_TARGET=image ./cairo-perf-trace gnome-system-monitor.trace > > === C slow path = > > [ # ] backend test min(s) median(s) stddev. count > [ # ]image: pixman 0.21.3 > [ 0]image gnome-system-monitor 15.011 15.034 0.08%6/6 > > === C fast path [1] = > > [ # ] backend test min(s) median(s) stddev. count > [ # ]image: pixman 0.21.3 > [ 0]image gnome-system-monitor 14.659 14.697 0.20%6/6 > > === SSE2 fast path [2] = > > [ # ] backend test min(s) median(s) stddev. count > [ # ]image: pixman 0.21.3 > [ 0]image gnome-system-monitor 14.431 14.496 0.19%6/6 > > === SSSE3 fast path [3] == > > [ # ] backend test min(s) median(s) stddev. count > [ # ]image: pixman 0.21.3 > [ 0]image gnome-system-monitor 14.455 14.496 0.17%6/6 > > == artificial test with just an empty stub for src_x888_ > > [ # ] backend test min(s) median(s) stddev. count > [ # ]image: pixman 0.21.3 > [ 0]image gnome-system-monitor 12.215 12.241 0.11%6/6 > > --- > > So I'm not sure if this SSE3 code is very useful as is. But of course, if some > practical use case makes a heavy use of this function so that it works on the > data residing in L1/L2 caches, then it could show a big improvement. > > > 1. http://cgit.freedesktop.org/pixman/commit/?id=16bae834 > 2. http://cgit.freedesktop.org/pixman/commit/?id=e0b430a1 > 3. http://lists.freedesktop.org/archives/pixman/2010-November/000742.html > > -- > Best regards, > Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Monday 29 November 2010 20:59:52 Siarhei Siamashka wrote: > On Wednesday 17 November 2010 07:47:39 Xu, Samuel wrote: > > For MOVD, we simplified the backward copy code since pervious code is too > > long and not gain obvious performance, > > And this is what I'm worried about. First you proposed this part of code > (backwards copy) without providing any initial explanation or benchmark > numbers. And now you abandon it, also without providing any benchmark > numbers or description of your test. But because you still have x86 > instructions there instead of MOVD, store forwarding aliasing surely may > affect performance, as demonstrated by my simple test program posted > earlier [3] as part of the discussion. > > Can we be sure that this SSSE3 patch actually provides any practical > performance improvement over the current pixman code? Considering that > compared to your initial profiling data, now SSE2 optimized code is used > for this operation and also problematic software prefetch which was almost > halving [4] effective memory bandwidth for this operation is now > eliminated [5]. > Surely, using 'test/lowlevel-blt-bench' microbenchmark it showed good > improvement for L1 cached data. But on the other hand, performance of small > random operations has dropped (shown as 'R' statistics). Maybe that's > happening because of the scanline function call overhead and this fwd/bwd > special handling for small buffer sizes? So a final verification with some > real practical use case would be interesting. > > [...] > > 1. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00357.html > 2. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00378.html > 3. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00415.html > 4. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00404.html > 5. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00563.html Just did some benchmarks with the 'gnome-system-monitor' cairo perf trace which happens to use 'src_x888_' operation. It looks like SSSE3 has about the same performance as SSE2, providing no measurable benefits on this use case. The low complexity of this operation and the hardware prefetcher make any differences between implementations much less noticeable in general. Intel Atom N450, 64-bit userland, gcc 4.5.1 $ CAIRO_TEST_TARGET=image ./cairo-perf-trace gnome-system-monitor.trace === C slow path = [ # ] backend test min(s) median(s) stddev. count [ # ]image: pixman 0.21.3 [ 0]image gnome-system-monitor 15.011 15.034 0.08%6/6 === C fast path [1] = [ # ] backend test min(s) median(s) stddev. count [ # ]image: pixman 0.21.3 [ 0]image gnome-system-monitor 14.659 14.697 0.20%6/6 === SSE2 fast path [2] = [ # ] backend test min(s) median(s) stddev. count [ # ]image: pixman 0.21.3 [ 0]image gnome-system-monitor 14.431 14.496 0.19%6/6 === SSSE3 fast path [3] == [ # ] backend test min(s) median(s) stddev. count [ # ]image: pixman 0.21.3 [ 0]image gnome-system-monitor 14.455 14.496 0.17%6/6 == artificial test with just an empty stub for src_x888_ [ # ] backend test min(s) median(s) stddev. count [ # ]image: pixman 0.21.3 [ 0]image gnome-system-monitor 12.215 12.241 0.11%6/6 --- So I'm not sure if this SSE3 code is very useful as is. But of course, if some practical use case makes a heavy use of this function so that it works on the data residing in L1/L2 caches, then it could show a big improvement. 1. http://cgit.freedesktop.org/pixman/commit/?id=16bae834 2. http://cgit.freedesktop.org/pixman/commit/?id=e0b430a1 3. http://lists.freedesktop.org/archives/pixman/2010-November/000742.html -- Best regards, Siarhei Siamashka signature.asc Description: This is a digitally signed message part. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Wednesday 17 November 2010 07:47:39 Xu, Samuel wrote: > Hi, Soeren Sandmann and Siarhei Siamashka: > Glad to send out this refreshed patch to address points we discussed > for this SSSE3 patch. Thanks. And sorry for a rather late reply. > In this new patch, we merged 32 bit and 64 bit asm code > into unified code (Macros are moved to header file, and make them more > reusable and easily to be read). Actually I'm a bit concerned about how the code looks now. There are still similar looking parts in 32-bit and 64-bit implementation which could be shared: +/* the macro definition of shift copy in stage one . 32bytes. + * Two 16-bytes XMM register will be packed into another 16-byte by using palignr + * instruction. */ +.macro SHL_COPY_STAGE_ONE x + movaps 16(%rsi), %xmm2 + sub $32, %rdx + movaps 32(%rsi), %xmm3 + lea 32(%rsi), %rsi + movdqa %xmm3, %xmm4 + palignr \x, %xmm2, %xmm3 + lea 32(%rdi), %rdi + palignr \x, %xmm1, %xmm2 + por %xmm6, %xmm2 + movaps %xmm2, -32(%rdi) + por %xmm6, %xmm3 + movaps %xmm3, -16(%rdi) +.endm and +/* the macro definition of shift copy in stage one . 32bytes */ +.macro SHL_COPY_STAGE_ONE x + movaps 16(%eax), %xmm2 + sub $32, %ecx + movaps 32(%eax), %xmm3 + lea 32(%eax), %eax + movdqa %xmm3, %xmm4 + palignr \x, %xmm2, %xmm3 + lea 32(%edx), %edx + palignr \x, %xmm1, %xmm2 + por %xmm6, %xmm2 + movaps %xmm2, -32(%edx) + por %xmm6, %xmm3 + movaps %xmm3, -16(%edx) +.endm The whole point of using macros is to make the code more maintainable, readable and easy to extend. Just splitting the code in chunks, putting them into macros with just the primary goal to reduce the total number of lines may actually have an opposite effect and result in a rather obfuscated sources instead. I don't see an obvious and clean way how to extend this code to support more compositing operations. After all, the implemented code is just only *one* of a huge variety of compositing operations supported by pixman. And this particular operation even should not be normally used much on properly written code. Basically, this is a simple conversion from x8r8g8b8 -> a8r8g8b8 format (by explicitly setting alpha channel value to 255, instead of having it undefined for x8r8g8b8). If it is done as an intermediate step somewhere in the rendering pipeline, then it's just a waste of cpu cycles (no matter if it is SSSE3 optimized or not, it should preferably be just eliminated completely). If some other compositing function expects a8r8g8b8 input, but is given x8r8g8b8, then this function can be trivially extended to also support x8r8g8b8, at worst by adding just a single OR instruction in it. And at best, it may even become faster after this change (only 3 color components need to be processed instead of 4). I already have mentioned it earlier regarding your use case [1]. Anyway, if the goal is to seriously improve pixman performance on Intel Atom, then a lot more different operations need to be optimized too. Another simple operation which is very similar to 'src_x888_' and still somewhat practically useful is 'add__' as mentioned earlier in [2]. Your code does not seem to be ready to support it without major modifications. I have attached some simple example of providing unified support for both 32-bit and 64-bit systems and implementing multiple operations. Surely, this example is missing non-executable stack markers, cfe directives and does not really utilize SIMD to get good performance. But probably something like this may be used as a starting point? With all the bells and whistles added, eventually it might start to resemble current pixman ARM NEON code a lot. That said, I actually don't want to advocate any particular way of implementing this stuff (prefer dynamic JIT code generation or assembly macros, AT&T or Intel syntax, etc). It's just my vision about what features it should eventually have. But if your code works correctly, does not add maintenance headaches (the part responsible for interacting with the rest of pixman must be clean) and provides some measurable performance improvement, I would be fine with it too. After all, it could be always safely scrapped and replaced with something better if needed. But more about this later. > We also fixed issue like CPU detection/git log/make file checking/file > name...etc. That would be nice to have as a separate patch. It has much higher chances to be accepted and committed first. > For MOVD, we simplified the backward copy code since pervious code is too > long and not gain obvious performance, And this is what I'm worried about. First you proposed this part of code (backwards copy) without providing any initial explanation or benchmark numbers. And now you abandon it, also without providing any benchmark numbers or description of
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
"Xu, Samuel" writes: > Appreciate comments on this patch I'll be away for the next three weeks, so I won't be able to review anything until then. Soren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Appreciate comments on this patch -Original Message- From: Xu, Samuel Sent: Wednesday, November 17, 2010 1:48 PM To: sandm...@daimi.au.dk; Siarhei Siamashka; Ma, Ling; Liu, Xinyun; Zhao, Yakui; pixman@lists.freedesktop.org Subject: RE: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8 Hi, Soeren Sandmann and Siarhei Siamashka: Glad to send out this refreshed patch to address points we discussed for this SSSE3 patch. In this new patch, we merged 32 bit and 64 bit asm code into unified code (Macros are moved to header file, and make them more reusable and easily to be read). We also fixed issue like CPU detection/git log/make file checking/file name...etc. For MOVD, we simplified the backward copy code since pervious code is too long and not gain obvious performance, Thanks! Samuel -Original Message- From: Xu, Samuel Sent: Thursday, September 09, 2010 12:56 PM To: 'sandm...@daimi.au.dk'; Siarhei Siamashka; Ma, Ling; Liu, Xinyun Subject: RE: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8 Hi, Soeren Sandmann and Siarhei Siamashka: It is really a long discussion and thanks for your patience! I believe all of us has same target to make pixman faster and better. Let's continue... Merge 32 bit and 64 bit asm code into unified asm file is possible while has pros and cons. E.g. I just tried Sun studio(my first experience on SS) and found SS 12 can't expand Macro correctly, so failed to build some asm files containing Macro. To make SS work, we originally want to remove existing macros, while it seems we should use more macros. Anyway, we will go ahead to following your 2 more suggestions: 1) movd adoption 2) unify 32 bit and 64 bit asm code via macro. So the new patch will be shaped as: 1. Fix 64 bit CPU detection issue for MMX and SSE2 2. Add more comments for git commit log 3. change SSSE3 intrinsics check to SSSE3 asm check in makefile 4. remove #include "pixman-combine32.h" and composite_over__n_in pixman-ssse3.c 5. ASM files changes: 1) Unify 32 bit and 64 bit asm code via macro 2) change asm file name to pixman-ssse3-x86-32-asm.S and pixman-ssse3-x86-64-asm.S 3) change the asm function name to "composite_line_src_x888__ssse3" 4) remove "defined(_M_AMD64))" 5) Comments in the assembly about the store forwarding 6) Use MOVD as Siarhei Siamashka's idea Please raise the flag if those change are not completed or if Sun studio's priority is higher than unified 32 bit and 64 bit code. Thanks! Samuel -Original Message- From: sandm...@daimi.au.dk [mailto:sandm...@daimi.au.dk] Sent: Thursday, September 09, 2010 10:16 AM To: Xu, Samuel Cc: Siarhei Siamashka; pixman@lists.freedesktop.org; Liu, Xinyun; Ma, Ling Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8 "Xu, Samuel" writes: > Hi, Soeren Sandmann and Siarhei Siamashka: > > As a wrap of current discussion, combining you two's comments, can we assume > this new patch of SSSE3 is ok? > New patch might contains: > 1. Fix 64 bit CPU detection issue for MMX and SSE2 2. Add more > comments for git commit log 3. change SSSE3 intrinsics check to SSSE3 > asm check in makefile 4. remove #include "pixman-combine32.h" and > composite_over__n_in pixman-ssse3.c 5. ASM files changes: > 1) change asm file name to pixman-ssse3-x86-32-asm.S and > pixman-ssse3-x86-64-asm.S > 2) change the asm function name to "composite_line_src_x888__ssse3" > 3) remove "defined(_M_AMD64))" > 4) Comments in the assembly about the store forwarding Such changes will certainly improve the patch. > We know there are some discussion on asm code, e.g. MOVD, unify 32 bit > and 64 bit code. While it won't introduce real defection. We still can > put further change in next wave, to avoid current sticking. No, we are not going to apply any patches that we know will need to be fixed later, because I don't believe anyone is going to do that work. It's fine to send a patch series if you think it's easier that way. I am also interested in the answer to the question of whether this code was generated by passing intrinsics through the Intel compiler. > I am not sure for the issue of sun studio. Sun studio declares GNU asm > compatibility, I am not sure whether it is 100% compatible. If issue > is caused by Sun Studio itself, can we add #ifdef to avoid > SSSE3 patch of Sun studio? In this case, how to determine Sun studio? There are various approaches you can take with respect to unusual compilers such as Sun Studio. The most common approach is to only enable the new optimizations for the compilers that you are personally interested in, and leave other compilers to people who are interested. However, *y
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Hi, Soeren Sandmann and Siarhei Siamashka: Glad to send out this refreshed patch to address points we discussed for this SSSE3 patch. In this new patch, we merged 32 bit and 64 bit asm code into unified code (Macros are moved to header file, and make them more reusable and easily to be read). We also fixed issue like CPU detection/git log/make file checking/file name...etc. For MOVD, we simplified the backward copy code since pervious code is too long and not gain obvious performance, Thanks! Samuel -Original Message- From: Xu, Samuel Sent: Thursday, September 09, 2010 12:56 PM To: 'sandm...@daimi.au.dk'; Siarhei Siamashka; Ma, Ling; Liu, Xinyun Subject: RE: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8 Hi, Soeren Sandmann and Siarhei Siamashka: It is really a long discussion and thanks for your patience! I believe all of us has same target to make pixman faster and better. Let's continue... Merge 32 bit and 64 bit asm code into unified asm file is possible while has pros and cons. E.g. I just tried Sun studio(my first experience on SS) and found SS 12 can't expand Macro correctly, so failed to build some asm files containing Macro. To make SS work, we originally want to remove existing macros, while it seems we should use more macros. Anyway, we will go ahead to following your 2 more suggestions: 1) movd adoption 2) unify 32 bit and 64 bit asm code via macro. So the new patch will be shaped as: 1. Fix 64 bit CPU detection issue for MMX and SSE2 2. Add more comments for git commit log 3. change SSSE3 intrinsics check to SSSE3 asm check in makefile 4. remove #include "pixman-combine32.h" and composite_over__n_in pixman-ssse3.c 5. ASM files changes: 1) Unify 32 bit and 64 bit asm code via macro 2) change asm file name to pixman-ssse3-x86-32-asm.S and pixman-ssse3-x86-64-asm.S 3) change the asm function name to "composite_line_src_x888__ssse3" 4) remove "defined(_M_AMD64))" 5) Comments in the assembly about the store forwarding 6) Use MOVD as Siarhei Siamashka's idea Please raise the flag if those change are not completed or if Sun studio's priority is higher than unified 32 bit and 64 bit code. Thanks! Samuel -Original Message- From: sandm...@daimi.au.dk [mailto:sandm...@daimi.au.dk] Sent: Thursday, September 09, 2010 10:16 AM To: Xu, Samuel Cc: Siarhei Siamashka; pixman@lists.freedesktop.org; Liu, Xinyun; Ma, Ling Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8 "Xu, Samuel" writes: > Hi, Soeren Sandmann and Siarhei Siamashka: > > As a wrap of current discussion, combining you two's comments, can we assume > this new patch of SSSE3 is ok? > New patch might contains: > 1. Fix 64 bit CPU detection issue for MMX and SSE2 2. Add more > comments for git commit log 3. change SSSE3 intrinsics check to SSSE3 > asm check in makefile 4. remove #include "pixman-combine32.h" and > composite_over__n_in pixman-ssse3.c 5. ASM files changes: > 1) change asm file name to pixman-ssse3-x86-32-asm.S and > pixman-ssse3-x86-64-asm.S > 2) change the asm function name to "composite_line_src_x888__ssse3" > 3) remove "defined(_M_AMD64))" > 4) Comments in the assembly about the store forwarding Such changes will certainly improve the patch. > We know there are some discussion on asm code, e.g. MOVD, unify 32 bit > and 64 bit code. While it won't introduce real defection. We still can > put further change in next wave, to avoid current sticking. No, we are not going to apply any patches that we know will need to be fixed later, because I don't believe anyone is going to do that work. It's fine to send a patch series if you think it's easier that way. I am also interested in the answer to the question of whether this code was generated by passing intrinsics through the Intel compiler. > I am not sure for the issue of sun studio. Sun studio declares GNU asm > compatibility, I am not sure whether it is 100% compatible. If issue > is caused by Sun Studio itself, can we add #ifdef to avoid > SSSE3 patch of Sun studio? In this case, how to determine Sun studio? There are various approaches you can take with respect to unusual compilers such as Sun Studio. The most common approach is to only enable the new optimizations for the compilers that you are personally interested in, and leave other compilers to people who are interested. However, *you* introduced all this Sun Studio handling for SSSE3, which means *you* need to be able to explain what it does, and justify why it is there. Or to put it even more bluntly: Do not cut and paste code you don't understand and can't test, and
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Although the code appended "or" operation, but after many tuning work, I don't see any regression compared with original gcc memcpy assembly code. Thanks Ling > -Original Message- > From: Ma, Ling > Sent: Thursday, September 09, 2010 1:55 PM > To: 'sandm...@daimi.au.dk'; Xu, Samuel > Cc: Siarhei Siamashka; pixman@lists.freedesktop.org; Liu, Xinyun > Subject: RE: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8 > > Hi Soren > > I am also interested in the answer to the question of whether this > > code was generated by passing intrinsics through the Intel compiler. > > I'm original author for this code, the code is based on our original memcpy > function which has been pushed into glibc. > > Thanks > Ling ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Hi Soren > I am also interested in the answer to the question of whether this > code was generated by passing intrinsics through the Intel compiler. I'm original author for this code, the code is based on our original memcpy function which has been pushed into glibc. Thanks Ling ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
"Xu, Samuel" writes: > Hi, Soeren Sandmann and Siarhei Siamashka: > > As a wrap of current discussion, combining you two's comments, can we assume > this new patch of SSSE3 is ok? > New patch might contains: > 1. Fix 64 bit CPU detection issue for MMX and SSE2 > 2. Add more comments for git commit log > 3. change SSSE3 intrinsics check to SSSE3 asm check in makefile > 4. remove #include "pixman-combine32.h" and composite_over__n_in > pixman-ssse3.c > 5. ASM files changes: > 1) change asm file name to pixman-ssse3-x86-32-asm.S and > pixman-ssse3-x86-64-asm.S > 2) change the asm function name to "composite_line_src_x888__ssse3" > 3) remove "defined(_M_AMD64))" > 4) Comments in the assembly about the store forwarding Such changes will certainly improve the patch. > We know there are some discussion on asm code, e.g. MOVD, unify 32 > bit and 64 bit code. While it won't introduce real defection. We > still can put further change in next wave, to avoid current > sticking. No, we are not going to apply any patches that we know will need to be fixed later, because I don't believe anyone is going to do that work. It's fine to send a patch series if you think it's easier that way. I am also interested in the answer to the question of whether this code was generated by passing intrinsics through the Intel compiler. > I am not sure for the issue of sun studio. Sun studio declares GNU > asm compatibility, I am not sure whether it is 100% compatible. If > issue is caused by Sun Studio itself, can we add #ifdef to avoid > SSSE3 patch of Sun studio? In this case, how to determine Sun > studio? There are various approaches you can take with respect to unusual compilers such as Sun Studio. The most common approach is to only enable the new optimizations for the compilers that you are personally interested in, and leave other compilers to people who are interested. However, *you* introduced all this Sun Studio handling for SSSE3, which means *you* need to be able to explain what it does, and justify why it is there. Or to put it even more bluntly: Do not cut and paste code you don't understand and can't test, and then ask me to rewrite the code for you until it is good enough to go in. Soren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Hi, Soeren Sandmann and Siarhei Siamashka: As a wrap of current discussion, combining you two's comments, can we assume this new patch of SSSE3 is ok? New patch might contains: 1. Fix 64 bit CPU detection issue for MMX and SSE2 2. Add more comments for git commit log 3. change SSSE3 intrinsics check to SSSE3 asm check in makefile 4. remove #include "pixman-combine32.h" and composite_over__n_in pixman-ssse3.c 5. ASM files changes: 1) change asm file name to pixman-ssse3-x86-32-asm.S and pixman-ssse3-x86-64-asm.S 2) change the asm function name to "composite_line_src_x888__ssse3" 3) remove "defined(_M_AMD64))" 4) Comments in the assembly about the store forwarding We know there are some discussion on asm code, e.g. MOVD, unify 32 bit and 64 bit code. While it won't introduce real defection. We still can put further change in next wave, to avoid current sticking. I am not sure for the issue of sun studio. Sun studio declares GNU asm compatibility, I am not sure whether it is 100% compatible. If issue is caused by Sun Studio itself, can we add #ifdef to avoid SSSE3 patch of Sun studio? In this case, how to determine Sun studio? Thanks! Samuel -Original Message- From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] Sent: Wednesday, September 08, 2010 3:39 AM To: Ma, Ling Cc: Xu, Samuel; Soeren Sandmann; pixman@lists.freedesktop.org; Liu, Xinyun Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8 On Tuesday 07 September 2010 14:03:52 Ma, Ling wrote: > > > > Wouldn't just the use of MOVD/MOVSS instructions here also solve > > > > this problem? Store forwarding does not seem to be used for SIMD > > > > according to the manual. I haven't benchmarked anything yet > > > > though. > [Ma Ling] movd is not good for Atom because of it will use AGU- " * > Integer-FP/SIMD transfer: Instructions that transfer integer data to > the FP/SIMD side of the machine also uses AGU. Examples of these > instructions include MOVD, PINSRW. If one of the source register of > these instructions depends on the result of an execution unit, this > dependency will also cause a delay of 3 cycles." 12.3.2.2 Address > Generation Well, a simple benchmark shows that MOVD performs at least as good as MOV if we compare the following loops: 80483e0: 66 0f 6e 06 movd (%esi),%xmm0 80483e4: 66 0f eb c1 por%xmm1,%xmm0 80483e8: 66 0f 7e 07 movd %xmm0,(%edi) 80483ec: 49 dec%ecx 80483ed: 75 f1 jne80483e0 vs. 80483e0: 8b 06 mov(%esi),%eax 80483e2: 09 d8 or %ebx,%eax 80483e4: 89 07 mov%eax,(%edi) 80483e6: 49 dec%ecx 80483e7: 75 f7 jne80483e0 Both of these loops use 3 cycles per iteration as can be easily measured by a simple modification of the benchmark program that I posted earlier. The only advantage of non-SSE code here is that it is smaller. But you are anyway killing this code size advantage by keeping both forward and backwards copy variants. The part from the optimization manual which you quoted is related to moving data between x86 and SSE registers, which is indeed a bad idea. But there is no need doing anything like this. You can just load data directly from memory to SSE registers, do some operations with it (using SSE instructions) and store the result to memory. > > Your code still can be simplified a lot. I'm just not quite sure > > whether it would be more practical to commit something first and > > then refactor it with the follow up commits. Or attempt to make a > > "perfect" patch before committing. > [Ma Ling] Yes, I agree with you, let us commit it first, then strength > it, such as appending non-temporary instructions for large data copy > which is over L1 cache size. Actually it's not up to me to decide. Much more important is whether Søren Sandmann agrees to this or not. I'm just afraid that trying to reach perfection, we may get stuck with no further progress at some point. Because for example some of the things, which are clear and simple for me, may be too unusual or unfamiliar for you. Or possibly the other way around. But in any case, the other issues pointed by Søren still can and need to be addressed. Just one last question to clarify things. Did you write this SSSE3 assembly code yourself or was the output of some C compiler (at least partially) used for it? -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Friday 03 September 2010 01:39:54 Soeren Sandmann wrote: > Siarhei Siamashka writes: > > Apparently software prefetch also disables or interferes with the hardware > > prefetcher on Intel Atom, hurting performance a lot. More advanced > > processors can cope with it. > > > > But increased prefetch distance is less effective (or can even decrease > > performance) when dealing with small images, so it is not always good. > > > > Are there any SSE2 capable x86 processors without hardware prefetch > > capability? Maybe it's really a good idea to remove software prefetch > > from SSE2 fast path code? > > Yeah, it seems so. All data so far suggests that software prefetching > is somewhere between 'slight slowdown' and 'no real effect'. The SSE2 > fast paths have very predictable memory access patterns, so a hardware > prefetcher should be able to do a good job. > > (It might be worth investigating whether software prefetch would be > beneficial in the tiled rotation fast paths, since the access patterns > there could be much harder to predict for the hardware). So I guess it makes sense for Liu, Xinyun to resubmit a clean patch with software prefetch removal for sse2 code instead of commenting it out? As suggested in: http://lists.freedesktop.org/archives/pixman/2010-June/000220.html -- Best regards, Siarhei Siamashka signature.asc Description: This is a digitally signed message part. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Tuesday 07 September 2010 14:03:52 Ma, Ling wrote: > > > > Wouldn't just the use of MOVD/MOVSS instructions here also solve > > > > this problem? Store forwarding does not seem to be used for SIMD > > > > according to the manual. I haven't benchmarked anything yet > > > > though. > [Ma Ling] movd is not good for Atom because of it will use AGU- > " * Integer-FP/SIMD transfer: Instructions that transfer integer data to > the FP/SIMD side of the machine also uses AGU. Examples of these > instructions include MOVD, PINSRW. If one of the source register of these > instructions depends on the result of an execution unit, this dependency > will also cause a delay of 3 cycles." 12.3.2.2 Address Generation Well, a simple benchmark shows that MOVD performs at least as good as MOV if we compare the following loops: 80483e0: 66 0f 6e 06 movd (%esi),%xmm0 80483e4: 66 0f eb c1 por%xmm1,%xmm0 80483e8: 66 0f 7e 07 movd %xmm0,(%edi) 80483ec: 49 dec%ecx 80483ed: 75 f1 jne80483e0 vs. 80483e0: 8b 06 mov(%esi),%eax 80483e2: 09 d8 or %ebx,%eax 80483e4: 89 07 mov%eax,(%edi) 80483e6: 49 dec%ecx 80483e7: 75 f7 jne80483e0 Both of these loops use 3 cycles per iteration as can be easily measured by a simple modification of the benchmark program that I posted earlier. The only advantage of non-SSE code here is that it is smaller. But you are anyway killing this code size advantage by keeping both forward and backwards copy variants. The part from the optimization manual which you quoted is related to moving data between x86 and SSE registers, which is indeed a bad idea. But there is no need doing anything like this. You can just load data directly from memory to SSE registers, do some operations with it (using SSE instructions) and store the result to memory. > > Your code still can be simplified a lot. I'm just not quite sure whether it > > would be more practical to commit something first and then refactor it with > > the follow up commits. Or attempt to make a "perfect" patch before > > committing. > [Ma Ling] Yes, I agree with you, let us commit it first, then strength it, > such as appending non-temporary instructions for large data copy which is > over L1 cache size. Actually it's not up to me to decide. Much more important is whether Søren Sandmann agrees to this or not. I'm just afraid that trying to reach perfection, we may get stuck with no further progress at some point. Because for example some of the things, which are clear and simple for me, may be too unusual or unfamiliar for you. Or possibly the other way around. But in any case, the other issues pointed by Søren still can and need to be addressed. Just one last question to clarify things. Did you write this SSSE3 assembly code yourself or was the output of some C compiler (at least partially) used for it? -- Best regards, Siarhei Siamashka signature.asc Description: This is a digitally signed message part. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
> Your code still can be simplified a lot. I'm just not quite sure whether it > would > be more practical to commit something first and then refactor it with the > follow > up commits. Or attempt to make a "perfect" patch before committing. [Ma Ling] Yes, I agree with you, let us commit it first, then strength it, such as appending non-temporary instructions for large data copy which is over L1 cache size. Best Regards Ling > > > pixman/pixman-access-ssse3_x86-64.S | 96 -- > 1 files changed, 12 insertions(+), 84 deletions(-) > > diff --git a/pixman/pixman-access-ssse3_x86-64.S b/pixman/pixman-access- > ssse3_x86-64.S index e7cf21f..0946d20 100755 > --- a/pixman/pixman-access-ssse3_x86-64.S > +++ b/pixman/pixman-access-ssse3_x86-64.S > @@ -248,116 +248,44 @@ L(shl_0_cache_less_16bytes): > add %rdx, %rdi > BRANCH_TO_JMPTBL_ENTRY (L(table_48bytes_fwd), %rdx, 4) > > -L(shl_4): > +.irp shift, 4, 8, 12 > +L(shl_\shift): > lea -32(%rdx), %rdx > ALIGN (4) > -L(shl_4_loop): > +L(shl_\shift\()_loop): > movaps 16(%rsi), %xmm2 > sub $32, %rdx > movaps 32(%rsi), %xmm3 > lea 32(%rsi), %rsi > movdqa %xmm3, %xmm4 > - palignr $4, %xmm2, %xmm3 > + palignr $\shift, %xmm2, %xmm3 > lea 32(%rdi), %rdi > - palignr $4, %xmm1, %xmm2 > + palignr $\shift, %xmm1, %xmm2 > por %xmm6, %xmm2 > movaps %xmm2, -32(%rdi) > por %xmm6, %xmm3 > movaps %xmm3, -16(%rdi) > - jb L(shl_4_end) > + jb L(shl_\shift\()_end) > > movaps 16(%rsi), %xmm2 > sub $32, %rdx > movaps 32(%rsi), %xmm3 > lea 32(%rsi), %rsi > movdqa %xmm3, %xmm1 > - palignr $4, %xmm2, %xmm3 > + palignr $\shift, %xmm2, %xmm3 > lea 32(%rdi), %rdi > - palignr $4, %xmm4, %xmm2 > + palignr $\shift, %xmm4, %xmm2 > por %xmm6, %xmm2 > movaps %xmm2, -32(%rdi) > por %xmm6, %xmm3 > movaps %xmm3, -16(%rdi) > - jae L(shl_4_loop) > -L(shl_4_end): > + jae L(shl_\shift\()_loop) > +L(shl_\shift\()_end): > lea 32(%rdx), %rdx > - lea 4(%rsi, %rdx), %rsi > - add %rdx, %rdi > - BRANCH_TO_JMPTBL_ENTRY(L(table_48bytes_fwd), %rdx, 4) > - > -L(shl_8): > - lea -32(%rdx), %rdx > - ALIGN (4) > -L(shl_8_loop): > - movaps 16(%rsi), %xmm2 > - sub $32, %rdx > - movaps 32(%rsi), %xmm3 > - lea 32(%rsi), %rsi > - movdqa %xmm3, %xmm4 > - palignr $8, %xmm2, %xmm3 > - lea 32(%rdi), %rdi > - palignr $8, %xmm1, %xmm2 > - por %xmm6, %xmm2 > - movaps %xmm2, -32(%rdi) > - por %xmm6, %xmm3 > - movaps %xmm3, -16(%rdi) > - jb L(shl_8_end) > - > - movaps 16(%rsi), %xmm2 > - sub $32, %rdx > - movaps 32(%rsi), %xmm3 > - lea 32(%rsi), %rsi > - movdqa %xmm3, %xmm1 > - palignr $8, %xmm2, %xmm3 > - lea 32(%rdi), %rdi > - palignr $8, %xmm4, %xmm2 > - por %xmm6, %xmm2 > - movaps %xmm2, -32(%rdi) > - por %xmm6, %xmm3 > - movaps %xmm3, -16(%rdi) > - jae L(shl_8_loop) > -L(shl_8_end): > - lea 32(%rdx), %rdx > - lea 8(%rsi, %rdx), %rsi > - add %rdx, %rdi > - BRANCH_TO_JMPTBL_ENTRY(L(table_48bytes_fwd), %rdx, 4) > - > -L(shl_12): > - lea -32(%rdx), %rdx > - ALIGN (4) > -L(shl_12_loop): > - movaps 16(%rsi), %xmm2 > - sub $32, %rdx > - movaps 32(%rsi), %xmm3 > - lea 32(%rsi), %rsi > - movdqa %xmm3, %xmm4 > - palignr $12, %xmm2, %xmm3 > - lea 32(%rdi), %rdi > - palignr $12, %xmm1, %xmm2 > - por %xmm6, %xmm2 > - movaps %xmm2, -32(%rdi) > - por %xmm6, %xmm3 > - movaps %xmm3, -16(%rdi) > - jb L(shl_12_end) > - > - movaps 16(%rsi), %xmm2 > - sub $32, %rdx > - movaps 32(%rsi), %xmm3 > - lea 32(%rsi), %rsi > - movdqa %xmm3, %xmm1 > - palignr $12, %xmm2, %xmm3 > - lea 32(%rdi), %rdi > - palignr $12, %xmm4, %xmm2 > - por %xmm6, %xmm2 > - movaps %xmm2, -32(%rdi) > - por %xmm6, %xmm3 > - movaps %xmm3, -16(%rdi) > - jae L(shl_12_loop) > -L(shl_12_end): > - lea 32(%rdx), %rdx > - lea 12(%rsi, %rdx), %rsi > + lea \shift\()(%rsi, %rdx), %rsi > add %rdx, %rdi > BRANCH_TO_JMPTBL_ENTRY(L(table_48bytes_fwd), %rdx, 4) > +.endr > > ALIGN (4) > L(fwd_write_44bytes): ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Hi Siarhei Siamashka > Could you elaborate? Preferably with a reference to the relevant section of > the > optimization manual. Because it looks like exactly the store forwarding > address > aliasing issue to me. [Ma Ling]:Currently it is not described in our optimization manual, soon it will be published in new version. > My understanding is that Intel Atom processor has some special logic for fast > handling of a quite common read-after-write memory access pattern (some > operations with the local variables on stack for example). Unfortunately only > the lowest 12 bits of the address are used for the initial selection of this > fast > path. And in the case if it turns out to be a false positive later in the > pipeline, > there is some performance penalty to handle this situation correctly. [Ma Ling]:Intel have this HW optimization is store-forward as you mentioned, but there are some constraints about it, the future doesn't cover the cases we described before. > Here is a simple test program which can be used for benchmarking: > > /***/ > .intel_syntax noprefix > .global main > > .data > > .balign 16 > buffer: > .rept 8192 > .byte 0 > .endr > > .text > > main: > pusha > > leaedi, buffer > #ifdef ALIASING > leaesi, [edi+4096] > #else > leaesi, [edi+4096+64] > #endif > movecx, 166000 /* 1.66GHz */ > jmp1f > > /* main loop */ > .balign 16 > 1: > #ifdef SSE > movd dword ptr [edi], xmm0 > movd dword ptr [edi+4], xmm1 > movd xmm0, dword ptr [esi] > movd xmm1, dword ptr [esi+4] > #else > movdword ptr [edi], eax > movdword ptr [edi+4], ebx > moveax, dword ptr [esi] > movebx, dword ptr [esi+4] > #endif > dececx > jnz1b > > popa > ret > /***/ > > $ gcc -m32 -o bench-x86-noaliasing bench-storefw.S $ gcc -m32 -DALIASING -o > bench-x86-aliasing bench-storefw.S $ gcc -m32 -DSSE -o bench-sse-noaliasing > bench-storefw.S $ gcc -m32 -DSSE -DALIASING -o bench-sse-aliasing > bench-storefw.S > > $ time ./bench-x86-noaliasing > > real0m4.057s > user0m4.032s > sys 0m0.000s > > $ time ./bench-x86-aliasing > > real0m11.059s > user0m11.041s > sys 0m0.008s > > $ time ./bench-sse-noaliasing > > real0m4.048s > user0m4.036s > sys 0m0.004s > > $ time ./bench-sse-aliasing > > real0m4.046s > user0m4.036s > sys 0m0.004s > > So each loop iteration always takes 4 cycles except when using standard x86 > mov instruction with the aliasing of the lowest 12 bits in the address. SSE > instructions movd/movss do not have any aliasing problems. > > > > Wouldn't just the use of MOVD/MOVSS instructions here also solve > > > this problem? Store forwarding does not seem to be used for SIMD > > > according to the manual. I haven't benchmarked anything yet > > > though. > > > > > > > > > http://lists.cairographics.org/archives/pixman/2010-August/000425.ht > > > ml > > I still think this needs to be clarified. Using movd or movss instructions > there > has an additional benefit that you can use the same operation on pixel data in > all cases. [Ma Ling] movd is not good for Atom because of it will use AGU- " * Integer-FP/SIMD transfer: Instructions that transfer integer data to the FP/SIMD side of the machine also uses AGU. Examples of these instructions include MOVD, PINSRW. If one of the source register of these instructions depends on the result of an execution unit, this dependency will also cause a delay of 3 cycles." 12.3.2.2 Address Generation Best Regards Ling ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Friday 03 September 2010 11:53:47 Xu, Samuel wrote: > >* Siarhei asked whether it would be possible to unify the 32 and 64 > > > > bit assembly sources. I don't think you commented on that. > > I think it is very difficult to unify 32 and 64 bit assemble src. It's not so difficult if you change the explicit register names to some macros. Then you can have something like SRC_PTR in the code which can expand to %rsi for x86-64 and to %esi for x86. There are other things which can be improved too. For example the use of ".irp/.endr" directive can reduce source code size by collapsing the repeatable blocks. Your code still can be simplified a lot. I'm just not quite sure whether it would be more practical to commit something first and then refactor it with the follow up commits. Or attempt to make a "perfect" patch before committing. pixman/pixman-access-ssse3_x86-64.S | 96 -- 1 files changed, 12 insertions(+), 84 deletions(-) diff --git a/pixman/pixman-access-ssse3_x86-64.S b/pixman/pixman-access- ssse3_x86-64.S index e7cf21f..0946d20 100755 --- a/pixman/pixman-access-ssse3_x86-64.S +++ b/pixman/pixman-access-ssse3_x86-64.S @@ -248,116 +248,44 @@ L(shl_0_cache_less_16bytes): add %rdx, %rdi BRANCH_TO_JMPTBL_ENTRY (L(table_48bytes_fwd), %rdx, 4) -L(shl_4): +.irp shift, 4, 8, 12 +L(shl_\shift): lea -32(%rdx), %rdx ALIGN (4) -L(shl_4_loop): +L(shl_\shift\()_loop): movaps 16(%rsi), %xmm2 sub $32, %rdx movaps 32(%rsi), %xmm3 lea 32(%rsi), %rsi movdqa %xmm3, %xmm4 - palignr $4, %xmm2, %xmm3 + palignr $\shift, %xmm2, %xmm3 lea 32(%rdi), %rdi - palignr $4, %xmm1, %xmm2 + palignr $\shift, %xmm1, %xmm2 por %xmm6, %xmm2 movaps %xmm2, -32(%rdi) por %xmm6, %xmm3 movaps %xmm3, -16(%rdi) - jb L(shl_4_end) + jb L(shl_\shift\()_end) movaps 16(%rsi), %xmm2 sub $32, %rdx movaps 32(%rsi), %xmm3 lea 32(%rsi), %rsi movdqa %xmm3, %xmm1 - palignr $4, %xmm2, %xmm3 + palignr $\shift, %xmm2, %xmm3 lea 32(%rdi), %rdi - palignr $4, %xmm4, %xmm2 + palignr $\shift, %xmm4, %xmm2 por %xmm6, %xmm2 movaps %xmm2, -32(%rdi) por %xmm6, %xmm3 movaps %xmm3, -16(%rdi) - jae L(shl_4_loop) -L(shl_4_end): + jae L(shl_\shift\()_loop) +L(shl_\shift\()_end): lea 32(%rdx), %rdx - lea 4(%rsi, %rdx), %rsi - add %rdx, %rdi - BRANCH_TO_JMPTBL_ENTRY(L(table_48bytes_fwd), %rdx, 4) - -L(shl_8): - lea -32(%rdx), %rdx - ALIGN (4) -L(shl_8_loop): - movaps 16(%rsi), %xmm2 - sub $32, %rdx - movaps 32(%rsi), %xmm3 - lea 32(%rsi), %rsi - movdqa %xmm3, %xmm4 - palignr $8, %xmm2, %xmm3 - lea 32(%rdi), %rdi - palignr $8, %xmm1, %xmm2 - por %xmm6, %xmm2 - movaps %xmm2, -32(%rdi) - por %xmm6, %xmm3 - movaps %xmm3, -16(%rdi) - jb L(shl_8_end) - - movaps 16(%rsi), %xmm2 - sub $32, %rdx - movaps 32(%rsi), %xmm3 - lea 32(%rsi), %rsi - movdqa %xmm3, %xmm1 - palignr $8, %xmm2, %xmm3 - lea 32(%rdi), %rdi - palignr $8, %xmm4, %xmm2 - por %xmm6, %xmm2 - movaps %xmm2, -32(%rdi) - por %xmm6, %xmm3 - movaps %xmm3, -16(%rdi) - jae L(shl_8_loop) -L(shl_8_end): - lea 32(%rdx), %rdx - lea 8(%rsi, %rdx), %rsi - add %rdx, %rdi - BRANCH_TO_JMPTBL_ENTRY(L(table_48bytes_fwd), %rdx, 4) - -L(shl_12): - lea -32(%rdx), %rdx - ALIGN (4) -L(shl_12_loop): - movaps 16(%rsi), %xmm2 - sub $32, %rdx - movaps 32(%rsi), %xmm3 - lea 32(%rsi), %rsi - movdqa %xmm3, %xmm4 - palignr $12, %xmm2, %xmm3 - lea 32(%rdi), %rdi - palignr $12, %xmm1, %xmm2 - por %xmm6, %xmm2 - movaps %xmm2, -32(%rdi) - por %xmm6, %xmm3 - movaps %xmm3, -16(%rdi) - jb L(shl_12_end) - - movaps 16(%rsi), %xmm2 - sub $32, %rdx - movaps 32(%rsi), %xmm3 - lea 32(%rsi), %rsi - movdqa %xmm3, %xmm1 - palignr $12, %xmm2, %xmm3 - lea 32(%rdi), %rdi - palignr $12, %xmm4, %xmm2 - por %xmm6, %xmm2 - movaps %xmm2, -32(%rdi) - por %xmm6, %xmm3 - movaps %xmm3, -16(%rdi) - jae L(shl_12_loop) -L(shl_12_end): - lea 32(%rdx), %rdx - lea 12(%rsi, %rdx), %rsi + lea \shift\()(%rsi, %rdx), %rsi add %rdx, %rdi BRANCH_TO_JMPTBL_ENTRY(L(table_48bytes_fwd), %rdx, 4) +.endr ALIGN (4) L(fwd_write_44bytes): signature.asc Description: This is a digitally signed message part. _
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Friday 03 September 2010 11:53:47 Xu, Samuel wrote: > >* Store forwarding > > > > - We need some comments in the assembly about the store forwarding > > > >that Ma Ling described. > > How about this comments added to asm code: > "CPU doesn't check each address bit for src and dest, so read could fail to > recognize different address even they are in different page, read > operation have to force previous write operation to commit data from store > buffer to cache, the process impact performance seriously. This is work > around to avoid this cpu limitation." > > > - Siarhei's questions about it should be answered, preferably in > > > >those comments: > > So it is basically a store forwarding aliasing problem which is > > described in "12.3.3.1 Store Forwarding" section from "Intel(r) 64 > > and IA-32 Architectures Optimization Reference Manual", right? > > The behavior doesn't belong to store forward which is described in > "12.3.3.1 Store Forwarding" section from "Intel(r) 64 and IA-32 > Architectures Optimization Reference Manual". That is another HW > optimization. Could you elaborate? Preferably with a reference to the relevant section of the optimization manual. Because it looks like exactly the store forwarding address aliasing issue to me. My understanding is that Intel Atom processor has some special logic for fast handling of a quite common read-after-write memory access pattern (some operations with the local variables on stack for example). Unfortunately only the lowest 12 bits of the address are used for the initial selection of this fast path. And in the case if it turns out to be a false positive later in the pipeline, there is some performance penalty to handle this situation correctly. Here is a simple test program which can be used for benchmarking: /***/ .intel_syntax noprefix .global main .data .balign 16 buffer: .rept 8192 .byte 0 .endr .text main: pusha leaedi, buffer #ifdef ALIASING leaesi, [edi+4096] #else leaesi, [edi+4096+64] #endif movecx, 166000 /* 1.66GHz */ jmp1f /* main loop */ .balign 16 1: #ifdef SSE movd dword ptr [edi], xmm0 movd dword ptr [edi+4], xmm1 movd xmm0, dword ptr [esi] movd xmm1, dword ptr [esi+4] #else movdword ptr [edi], eax movdword ptr [edi+4], ebx moveax, dword ptr [esi] movebx, dword ptr [esi+4] #endif dececx jnz1b popa ret /***/ $ gcc -m32 -o bench-x86-noaliasing bench-storefw.S $ gcc -m32 -DALIASING -o bench-x86-aliasing bench-storefw.S $ gcc -m32 -DSSE -o bench-sse-noaliasing bench-storefw.S $ gcc -m32 -DSSE -DALIASING -o bench-sse-aliasing bench-storefw.S $ time ./bench-x86-noaliasing real0m4.057s user0m4.032s sys 0m0.000s $ time ./bench-x86-aliasing real0m11.059s user0m11.041s sys 0m0.008s $ time ./bench-sse-noaliasing real0m4.048s user0m4.036s sys 0m0.004s $ time ./bench-sse-aliasing real0m4.046s user0m4.036s sys 0m0.004s So each loop iteration always takes 4 cycles except when using standard x86 mov instruction with the aliasing of the lowest 12 bits in the address. SSE instructions movd/movss do not have any aliasing problems. > > Wouldn't just the use of MOVD/MOVSS instructions here also solve > > this problem? Store forwarding does not seem to be used for SIMD > > according to the manual. I haven't benchmarked anything yet > > though. > > > >http://lists.cairographics.org/archives/pixman/2010-August/000425.html I still think this needs to be clarified. Using movd or movss instructions there has an additional benefit that you can use the same operation on pixel data in all cases. Currently you use "por %xmm6, %xmm0" in the main loop and "or 40(%rsi), %ecx" for the trailing pixels. You could easily use just "por" instruction everywhere. It's not a big deal here, just inconsistent. With your approach and more complex compositing operations, you would need to maintain both sse2 and x86 implementations separately for the main loop and for the trailing pixels. This can and IMHO should be avoided. -- Best regards, Siarhei Siamashka signature.asc Description: This is a digitally signed message part. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Thanks Soren! Following is my comment. Appreciate your explicitly code guide on Sun Studio check, I never has experience on Sun Studio. After confirming of Sun Studio related, we can send next patch to address rest issues. Samuel > * The 64 bit CPU detection is broken. It doesn't use SSE2 and MMX when > SSSE3 is not detected. Confirmed and will fix in next patch as following : ... #endif /* end dt_cpu64() */ #ifdef USE_MMX #if (!defined(__amd64__) && !defined(__x86_64__) && !defined(_M_AMD64)) //32 bit MMX static pixman_bool_t pixman_have_mmx (void) { .. return mmx_present; } #else //64 bit MMX #define pixman_have_mmx() TRUE #endif #endif #ifdef USE_SSE2 #if (!defined(__amd64__) && !defined(__x86_64__) && !defined(_M_AMD64)) //32 bit SSE2 static pixman_bool_t pixman_have_sse2 (void) { ... return sse2_present; } #else //64 bit SSE2 #define pixman_have_sse2() TRUE #endif #endif >* Siarhei asked whether it would be possible to unify the 32 and 64 > bit assembly sources. I don't think you commented on that. I think it is very difficult to unify 32 and 64 bit assemble src. >* We need a more detailed description in the git commit log Will add more detail in next patch >* The check for the Sun Studio compiler is wrong. If SSSE3 is only > supported in the latest version of Sun Studio, then configure should > check for that version. Could you show us the error information from Sun Studio? I never have experience on Sun Studio. Appreciate your specific code guide on Sun Studio check >* Will Sun Studio actually accept GNU assembly files? http://developers.sun.com/sunstudio/support/Ccompare.html mentioned that " Support for GNU-style inline assembly ... " > * Why check for SSSE3 intrinsics at all? You don't use them for > anything. All use of SSSE3 is in assembly files. The thing that > needs to be checked for is whether the assembly files will pass > through the compiler. Yes, it is SSSE3 asm check. Confirmed and will fix in next patch >* Store forwarding > - We need some comments in the assembly about the store forwarding >that Ma Ling described. How about this comments added to asm code: "CPU doesn't check each address bit for src and dest, so read could fail to recognize different address even they are in different page, read operation have to force previous write operation to commit data from store buffer to cache, the process impact performance seriously. This is work around to avoid this cpu limitation." > - Siarhei's questions about it should be answered, preferably in >those comments: > So it is basically a store forwarding aliasing problem which is > described in "12.3.3.1 Store Forwarding" section from "Intel(r) 64 > and IA-32 Architectures Optimization Reference Manual", right? The behavior doesn't belong to store forward which is described in "12.3.3.1 Store Forwarding" section from "Intel(r) 64 and IA-32 Architectures Optimization Reference Manual". That is another HW optimization. > Wouldn't just the use of MOVD/MOVSS instructions here also solve > this problem? Store forwarding does not seem to be used for SIMD > according to the manual. I haven't benchmarked anything yet > though. > http://lists.cairographics.org/archives/pixman/2010-August/000425.html >* About this line in the 64 bit assembly and the corresponding one in > the 32 bit assembly: > #if (defined(__amd64__) || defined(__x86_64__) ||defined(_M_AMD64)) > Makoto Kato asked: >Why _M_AMD64? _M_AMD64 macro is for MSVC++ x64 version. This >assembler is for GNU as. > You didn't comment on this. Will remove _M_AMD64 inside asm file in next patch >* We might as well be consistent with the naming used in the ARM NEON > assembly files: > - The assembly file should be called "pixman-ssse3-asm.S", or >"pixman-ssse3-x86-32-asm.S and pixman-ssse3-x86-64-asm.S" if there >is a good reason they can't be unified. Will change file name in next patch > - The function in the assembly should be called something like >composite_line_src_x888__ssse3 Will change fun name in next patch >"line" to indicate the fact that it doesn't do full 2D, and the >"fast_path" in the current name is confusing because it suggest a >connection to the plain C 'fast path' implementation (there is no >such connection). >* Cut-and-paste issues in pixman-ssse3: > - #include "pixman-combine32.h" Will remove it in next patch > This is not used as far as I can tell > - /* PIXMAN_OP_OVER */ > The two fast paths are both SRC fast paths, not OVER. > - /*- >* composite_over__n_ >*/ > >The function is src_x888_(), not over__n_(). But the > comment in unneccessary since there is only one function. Will remove it in next patch ___ Pi
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Siarhei Siamashka writes: > Looks like this data has been posted already: > http://lists.freedesktop.org/archives/pixman/2010-June/000231.html > > Checking a few more things with microbenchmarks shows that the prefetch > distance of just 64 bytes ahead is way too small. We should probably get these microbenchmarks into the tree. Microbenchmarks can be a little dangerous since they sometimes amplify irrelevant details and make them seem important, but they are clearly useful in many cases. > It has to be increased up to something like 256-320 to get good > memory performance. Apparently software prefetch also disables or > interferes with the hardware prefetcher on Intel Atom, hurting > performance a lot. More advanced processors can cope with it. > > But increased prefetch distance is less effective (or can even decrease > performance) when dealing with small images, so it is not always good. > > Are there any SSE2 capable x86 processors without hardware prefetch > capability? > Maybe it's really a good idea to remove software prefetch from SSE2 fast path > code? Yeah, it seems so. All data so far suggests that software prefetching is somewhere between 'slight slowdown' and 'no real effect'. The SSE2 fast paths have very predictable memory access patterns, so a hardware prefetcher should be able to do a good job. (It might be worth investigating whether software prefetch would be beneficial in the tiled rotation fast paths, since the access patterns there could be much harder to predict for the hardware). Soren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
"Xu, Samuel" writes: > Hi, Siarhei Siamashka: > Attached patch has updated copyright part (only copyright change). we > referred http://cgit.freedesktop.org/pixman/tree/COPYING. > Yes, As you assumed, we tested on multiple 32/64 bit boxes w/o > and w/ SSSE3. Here are some more comments: * The 64 bit CPU detection is broken. It doesn't use SSE2 and MMX when SSSE3 is not detected. * Siarhei asked whether it would be possible to unify the 32 and 64 bit assembly sources. I don't think you commented on that. * We need a more detailed description in the git commit log * The check for the Sun Studio compiler is wrong. If SSSE3 is only supported in the latest version of Sun Studio, then configure should check for that version. * Will Sun Studio actually accept GNU assembly files? * Why check for SSSE3 intrinsics at all? You don't use them for anything. All use of SSSE3 is in assembly files. The thing that needs to be checked for is whether the assembly files will pass through the compiler. * Store forwarding - We need some comments in the assembly about the store forwarding that Ma Ling described. - Siarhei's questions about it should be answered, preferably in those comments: So it is basically a store forwarding aliasing problem which is described in "12.3.3.1 Store Forwarding" section from "Intel® 64 and IA-32 Architectures Optimization Reference Manual", right? Wouldn't just the use of MOVD/MOVSS instructions here also solve this problem? Store forwarding does not seem to be used for SIMD according to the manual. I haven't benchmarked anything yet though. http://lists.cairographics.org/archives/pixman/2010-August/000425.html * About this line in the 64 bit assembly and the corresponding one in the 32 bit assembly: #if (defined(__amd64__) || defined(__x86_64__) ||defined(_M_AMD64)) Makoto Kato asked: Why _M_AMD64? _M_AMD64 macro is for MSVC++ x64 version. This assembler is for GNU as. You didn't comment on this. * We might as well be consistent with the naming used in the ARM NEON assembly files: - The assembly file should be called "pixman-ssse3-asm.S", or "pixman-ssse3-x86-32-asm.S and pixman-ssse3-x86-64-asm.S" if there is a good reason they can't be unified. - The function in the assembly should be called something like composite_line_src_x888__ssse3 "line" to indicate the fact that it doesn't do full 2D, and the "fast_path" in the current name is confusing because it suggest a connection to the plain C 'fast path' implementation (there is no such connection). * Cut-and-paste issues in pixman-ssse3: - #include "pixman-combine32.h" This is not used as far as I can tell - /* PIXMAN_OP_OVER */ The two fast paths are both SRC fast paths, not OVER. - /*- * composite_over__n_ */ The function is src_x888_(), not over__n_(). But the comment in unneccessary since there is only one function. Soren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Monday 30 August 2010 23:31:35 Siarhei Siamashka wrote: > And Intel Atom does not like software prefetch very much. This reminds me an > older report: > http://lists.freedesktop.org/archives/pixman/2010-June/000218.html > > I can try to run a full set of cairo-perf-trace benchmarks to get more > relevant numbers regarding the effect of software prefetch on Intel Atom. Looks like this data has been posted already: http://lists.freedesktop.org/archives/pixman/2010-June/000231.html Checking a few more things with microbenchmarks shows that the prefetch distance of just 64 bytes ahead is way too small. It has to be increased up to something like 256-320 to get good memory performance. Apparently software prefetch also disables or interferes with the hardware prefetcher on Intel Atom, hurting performance a lot. More advanced processors can cope with it. But increased prefetch distance is less effective (or can even decrease performance) when dealing with small images, so it is not always good. Are there any SSE2 capable x86 processors without hardware prefetch capability? Maybe it's really a good idea to remove software prefetch from SSE2 fast path code? -- Best regards, Siarhei Siamashka signature.asc Description: This is a digitally signed message part. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Monday 30 August 2010 16:06:45 Siarhei Siamashka wrote: > Also I tried to run my simple microbenchmarking program on Intel Atom N450 > netbook, x86_64 system. The results are the following: > > -- > All results are presented in millions of pixels per second > L1 - small Xx1 rectangle (fitting L1 cache), always blitted at the same > memory location with small drift in horizontal direction > L2 - small XxY rectangle (fitting L2 cache), always blitted at the same > memory location with small drift in horizontal direction > M - large 1856x1080 rectangle, always blitted at the same > memory location with small drift in horizontal direction > HT - random rectangles with 32x32 average size are copied from > one 1920x1080 buffer to another, traversing from left to right > and from top to bottom > VT - random rectangles with 32x32 average size are copied from > one 1920x1080 buffer to another, traversing from top to bottom > and from left to right > R - random rectangles with 32x32 average size are copied from > random locations of one 1920x1080 buffer to another > --- > reference memcpy speed = 1007.2MB/s (251.8MP/s for 32bpp pixels) > > --- C --- > src_x888_ = L1: 265.13 L2: 223.61 M:216.12 HT:109.69 VT: 80.23 R: 75.42 > > --- SSE2 --- > src_x888_ = L1: 611.50 L2: 494.79 M:120.17 HT: 83.57 VT: 79.48 R: 60.30 > > --- SSE2 (prefetch removed) --- > src_x888_ = L1: 683.39 L2: 539.57 M:260.07 HT:128.22 VT: 85.23 R: 81.40 > > --- SSSE3 --- > src_x888_ = L1:1559.35 L2: 798.95 M:254.31 HT:129.29 VT: 84.67 R: 75.00 And now a benchmark from Intel Core i7 860 too (with the buffer sizes increased to be sure that they are much larger than 8MB cache): --- reference memcpy speed = 6491.5MB/s (1622.9MP/s for 32bpp pixels) --- C --- src_x888_ = L1:1145.72 L2:1124.48 M:902.11 HT:438.52 VT:351.58 R:316.74 --- SSE2 --- src_x888_ = L1:5628.96 L2:2766.65 M:1060.89 HT:576.04 VT:488.81 R:418.52 --- SSE2 (prefetch removed) --- src_x888_ = L1:6140.86 L2:2730.58 M:1066.52 HT:594.88 VT:484.12 R:423.24 --- SSSE3 --- src_x888_ = L1:5182.91 L2:2617.74 M:1064.30 HT:673.20 VT:568.15 R:441.97 So looks like on Intel Core i7 the differences between all the SSE2/SSSE3 implementations are mostly irrelevant. Standard memcpy shows better memory bandwidth because it uses nontemporal writes for copying very large buffers. If using save_128_write_combining() instead of save_128_aligned() in sse2_composite_src_x888_(), then src_x888_ operation becomes just as fast as memcpy. For Intel Atom, SSSE3 seems to improve performance a lot when working with L1/L2 cache, but has no advantage when working with large memory buffers. Still I think it may be useful for hyperthreading (the other virtual thread may have more resources to do something while src_x888_ is waiting for memory). On the other hand, nontemporal writes can increase performance noticeably if the buffer is known to be much bigger than L2 cache. And Intel Atom does not like software prefetch very much. This reminds me an older report: http://lists.freedesktop.org/archives/pixman/2010-June/000218.html I can try to run a full set of cairo-perf-trace benchmarks to get more relevant numbers regarding the effect of software prefetch on Intel Atom. -- Best regards, Siarhei Siamashka signature.asc Description: This is a digitally signed message part. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Monday 30 August 2010 12:26:27 Xu, Samuel wrote: > Hi, Siarhei Siamashka: > Sorry for a typo, fixed version is attached. Pls ignore pervious mail. > > New patch, which contains > 1)Simplified 64 bit detect_cpu_features(), only check SSSE3 bit. > _MSC_VER > path switched to __cpuid to avoid inline asm. _MSC_VER path CPUID code > snatch is extracted and tested on standalone C file in a 64 bit windows > system, using VS2010. 2) removed "merging", pixman-ssse3.c is shaped as > our discussion. +/* + * Copyright 2010 Intel Corporation + * + * Permission to use, copy, modify, distribute, and sell this software and its + * documentation for any purpose is hereby granted without fee, provided that + * the above copyright notice appear in all copies and that both that + * copyright notice and this permission notice appear in supporting + * documentation, and that the name of Mozilla Corporation not be used in ^^^ Mozilla Corporation? Looks like this is yet another case of careless copy/paste. And I would like to remind again that the following text of copyright notice is preferred for new code. It is not strictly necessary, but I just want to be sure that you checked this link: http://cgit.freedesktop.org/pixman/tree/COPYING Other than that, and assuming that the patch was properly tested on 32-bit and 64-bit machines both with and without SSSE3 support, I don't see any remaining really blocker issues. Or more like I'm giving up and would like to pass the baton to somebody else. Also I tried to run my simple microbenchmarking program on Intel Atom N450 netbook, x86_64 system. The results are the following: -- All results are presented in millions of pixels per second L1 - small Xx1 rectangle (fitting L1 cache), always blitted at the same memory location with small drift in horizontal direction L2 - small XxY rectangle (fitting L2 cache), always blitted at the same memory location with small drift in horizontal direction M - large 1856x1080 rectangle, always blitted at the same memory location with small drift in horizontal direction HT - random rectangles with 32x32 average size are copied from one 1920x1080 buffer to another, traversing from left to right and from top to bottom VT - random rectangles with 32x32 average size are copied from one 1920x1080 buffer to another, traversing from top to bottom and from left to right R - random rectangles with 32x32 average size are copied from random locations of one 1920x1080 buffer to another --- reference memcpy speed = 1007.2MB/s (251.8MP/s for 32bpp pixels) --- C --- src_x888_ = L1: 265.13 L2: 223.61 M:216.12 HT:109.69 VT: 80.23 R: 75.42 --- SSE2 --- src_x888_ = L1: 611.50 L2: 494.79 M:120.17 HT: 83.57 VT: 79.48 R: 60.30 --- SSE2 (prefetch removed) --- src_x888_ = L1: 683.39 L2: 539.57 M:260.07 HT:128.22 VT: 85.23 R: 81.40 --- SSSE3 --- src_x888_ = L1:1559.35 L2: 798.95 M:254.31 HT:129.29 VT: 84.67 R: 75.00 -- Best regards, Siarhei Siamashka signature.asc Description: This is a digitally signed message part. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Sunday 29 August 2010 12:25:47 Xu, Samuel wrote: > Hi, Siarhei Siamashka: > Q:---" What problems do you have without "merge" mechanism?" > A: Of course there isn't correctness issue w/o "merge". > Currently, sse2_fast_paths/mmx_fast_paths/c_fast_paths...are excluded each > other, although some checking forms delegate chain. While after delegate > chain formed, only one fast table effect. Is my understanding correct? So > after ssse3_fast_paths is newly added, only SSSE3 table will be effect > after SSSE3 CPUID is detected. Of course we can just keep 2 new entries > with SSSE3 asm optimizations, the issue is: we lost the optimizations > which already exist in current SSE2 table. So, w/o merging or w/o totally > duplication from SSE2 file, there will be performance unfriendly. No, all the fast path tables are tried one after another, so there should be no problem using SSE2 and MMX fast paths as fallbacks. That's the whole point of having delegates. If you see any issue here, that's likely a bug. Still merging fast path tables may make the search for the needed fast path function a bit faster in the case of a fast path cache miss. And there might be a bit more motivation to pursue such optimizations once SSE2 is not at a top of delegate chain anymore ;) This is especially important for the combiner functions which are called for each scanline, so walking the delegate chain is highly undesirable there. The simple solution at least for the combiner functions is to just copy all the pointers for unimplemented functions from the previous implementation at setup time. And thus make sure that 'delegate_combine_32' is never ever called. I tried to propose something like this 'pixman_blt' delegates earlier (for example, see "imp->blt = fast->blt;" for the changes in pixman-vmx.c): http://cgit.freedesktop.org/~siamashka/pixman/log/?h=no-delegates But as I mentioned before, any merging of fast path tables or improvement for the delegates chain handling is an additional optimization. And it belongs to a separate patch. > Sure, It is ok for us for this "correctness firstly, then performance in > next wave" philosophy. A new SSSE3 file with only 2 fast path entries is > ok? Yes, this looks right. > For simplifying on CPU detection, I think same " correctness firstly, then > performance in next wave" philosophy might be followed. A full CPU > detection for 64 bit can be added firstly as a baseline, the one who can > test win32 and Solaris can help us to make it shorter. What "correctness" you are talking about? As already noticed by Makoto Kato earlier [1], MSVC does not support inline assembly in 64-bit mode So your patch was clearly broken and untested in this configuration. IMHO it is always easier to add missing functionality in the future (support for SSSE3 optimizations on the systems other than linux) than removing broken untested system-dependent garbage from the source code. Especially considering that these other systems still need to do something about the GNU assembler part of your patch. But again, a comment from somebody else would be very much welcome here. I don't have a strong opinion about this part. 1. http://lists.freedesktop.org/archives/pixman/2010-August/000424.html -- Best regards, Siarhei Siamashka signature.asc Description: This is a digitally signed message part. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Hi, Siarhei Siamashka: Q:---" What problems do you have without "merge" mechanism?" A: Of course there isn't correctness issue w/o "merge". Currently, sse2_fast_paths/mmx_fast_paths/c_fast_paths...are excluded each other, although some checking forms delegate chain. While after delegate chain formed, only one fast table effect. Is my understanding correct? So after ssse3_fast_paths is newly added, only SSSE3 table will be effect after SSSE3 CPUID is detected. Of course we can just keep 2 new entries with SSSE3 asm optimizations, the issue is: we lost the optimizations which already exist in current SSE2 table. So, w/o merging or w/o totally duplication from SSE2 file, there will be performance unfriendly. Sure, It is ok for us for this "correctness firstly, then performance in next wave" philosophy. A new SSSE3 file with only 2 fast path entries is ok? For simplifying on CPU detection, I think same " correctness firstly, then performance in next wave" philosophy might be followed. A full CPU detection for 64 bit can be added firstly as a baseline, the one who can test win32 and Solaris can help us to make it shorter. As a wrap, new path will: 1) keep most 64 bit CPU detection as currently patch (GNU C part can reduce some edx checking) 2) A new SSSE3 file with only 2 fast path entries for newly added ASM optimization Comments? Samuel -Original Message- From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] Sent: Friday, August 27, 2010 10:57 PM To: Xu, Samuel Cc: pixman@lists.freedesktop.org; Ma, Ling; Liu, Xinyun Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8 On Friday 27 August 2010 15:00:49 Xu, Samuel wrote: > Hi, Siarhei Siamashka: > Thanks for quick response! > For 64 bit detect_cpu_features(), if ignore HAVE_GETISAX and _MSC_VER, > it is ok for us to simplify it as your example in next update. If you can ensure MSVC compatibility and make it work with your optimizations, then it would be really great. But if it is totally untested, I don't feel comfortable about having it just blindly replicated from 32 to 64 bits with the hope that it will work. It's just my opinion, the others may disagree. And the others may also try to test your patch on win32 or solaris systems, providing a lot more useful feedback than me. > For pixman-ssse3.c, maybe we have 2 options: > 1) duplicate 6562 lines from pixman-sse2.c to new pixman- > ssse3.c in 1st patch (of course to replace 2 entries with newly added > SSSE3 asm optimization), and then add "merge" mechanism in later patch. No, there is no need to duplicate anything. > 2) firstly add "merge" mechanism patch, and the added new pixman-ssse3.c in > later patch, which might be shorter (111 lines) Does it mean > 1) option is preferred? What problems do you have without "merge" mechanism? The pixman-sse2.c works fine without it, and it does properly fallback to MMX code if SSE2 does not support some operations. Similarly, SSSE3 can fallback to SSE2 in the very same way. -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Friday 27 August 2010 15:00:49 Xu, Samuel wrote: > Hi, Siarhei Siamashka: > Thanks for quick response! > For 64 bit detect_cpu_features(), if ignore HAVE_GETISAX and _MSC_VER, > it is ok for us to simplify it as your example in next update. If you can ensure MSVC compatibility and make it work with your optimizations, then it would be really great. But if it is totally untested, I don't feel comfortable about having it just blindly replicated from 32 to 64 bits with the hope that it will work. It's just my opinion, the others may disagree. And the others may also try to test your patch on win32 or solaris systems, providing a lot more useful feedback than me. > For pixman-ssse3.c, maybe we have 2 options: > 1) duplicate 6562 lines from pixman-sse2.c to new pixman- > ssse3.c in 1st patch (of course to replace 2 entries with newly added > SSSE3 asm optimization), and then add "merge" mechanism in later patch. No, there is no need to duplicate anything. > 2) firstly add "merge" mechanism patch, and the added new pixman-ssse3.c in > later patch, which might be shorter (111 lines) Does it mean > 1) option is preferred? What problems do you have without "merge" mechanism? The pixman-sse2.c works fine without it, and it does properly fallback to MMX code if SSE2 does not support some operations. Similarly, SSSE3 can fallback to SSE2 in the very same way. -- Best regards, Siarhei Siamashka signature.asc Description: This is a digitally signed message part. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
I guess detect_cpu_features() is trying to support both MSVC and GNU. For CPUID, intric is not used. I am not sure the real reason. While I didn't find CPUID intric in intric header for GNU. -Original Message- From: Makoto Kato [mailto:m_k...@ga2.so-net.ne.jp] Sent: Friday, August 27, 2010 2:15 PM To: Xu, Samuel Cc: pixman@lists.freedesktop.org Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8 Hi, Xu. > --- /dev/null > +++ b/pixman/pixman-access-ssse3_x86-64.S .. > +#if (defined(__amd64__) || defined(__x86_64__) ||defined(_M_AMD64)) why _M_AMD64? _M_AMD64 macro is for MSVC++ x64 version. This assembler is for GNU as. Also, if MSVC++ 8.0 (14.00) or later, we should use __cpuid for x64 and x86 compatibility. http://msdn.microsoft.com/ja-jp/library/hskdteyh.aspx -- Makoto On 2010/08/27 11:59, Xu, Samuel wrote: > Hi Siarhei Siamashka, > > Here is a new patch, can you review it? Thank you! > It address following suggestions: > 1: SSSE3 file is split to a new file. Comparing with to duplicate every > content from SSE2 file, I added a way to merge new fast path function pointer > table runtimely by CPUID check, to avoid huge code expansion and a bug you > pointed out.(Without runtime merging, I have to duplicate every line from > SSE2 file to SSSE3 file, it is really not graceful and redundant. > 2: Copy right information added > 3: Makefile fix > 4: author name fix > > For detect_cpu_features(), we possibly can remove the TWO lines asm code to > copy edx to "result", to avoid SSE2 and MMX checking, since most preparing > code still need. If we keep this 2 lines, it still give us some possibility > to check other edx information of CPUID in the future. It is only executed > once when pixman is loaded. > > For ASM code of 'bk_write' and 'table_48_bytes_bwd', here is the reason of > "why we use backward copy mode in the patch ", from Ling: > All read operations after allocation stage can run speculatively, all write > operation will run in program order, and if addresses are different read may > run before older write operation, otherwise wait until write commit. However > CPU don't check each address bit, so read could fail to recognize different > address even theyare in different page. > For example if rsi is 0xf004, rdi is 0xe008,in following operation there will > generate big performance latency. > 1. movq (%rsi), %rax > 2. movq %rax, (%rdi) > 3. movq 8(%rsi), %rax > 4. movq %rax, 8(%rdi) > > If %rsi and rdi were in really the same meory page, there are TRUE > read-after-write dependence because instruction 2 write 0x008 and instruction > 3 read 0x00c, the two address are overlap partially. Actually there are in > different page and no any issues, but without checking each address bit CPU > could think they are in the same page, and instruction 3 have to wait for > instruction 2 to write data into cache from write buffer, then load data from > cache, the cost time read spent is equal to mfence instruction. > We may avoid it by tuning operation sequence as follow. > 1. movq 8(%rsi), %rax > 2. movq %rax, 8(%rdi) > 3. movq (%rsi), %rax > 4. movq %rax, (%rdi) > Instruction 3 read 0x004, instruction 2 write address 0x010, no any > dependence. > In this patch we first handle small size(less 48bytes) by this way, and the > approach help us improve up to 2X on Atom. > > Regards, > Liu, Xinyun > > -Original Message- > From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] > Sent: Sunday, August 22, 2010 1:49 AM > To: Liu, Xinyun > Cc: pixman@lists.freedesktop.org; Ma, Ling; Xu, Samuel > Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8 > > On Friday 20 August 2010 18:39:47 Liu, Xinyun wrote: >> Hi Siarhei Siamashka, >> >> Here is a new patch, can you review it? Thank you! > Sure, thanks for the updated patch. Some comments follow. > >> From 9783651899a2763d7fcca2960fc354bd1f541980 Mon Sep 17 00:00:00 2001 >> From: root > A minor nitpick here. The 'From:' header seems to be wrong, it does not look > like a real patch author name. > >> +dnl Check for SSSE3 >> + >> +if test "x$SSSE3_CFLAGS" = "x" ; then >> + if test "x$SUNCC" = "xyes"; then >> + # SSSE3 is enabled by default in the Sun Studio 64-bit >> +environment > Is it a valid statement with the regards to Sun Studio? Looks like it is a > direct copy/paste from SSE2. Can we be sure that there is nothing redundant > or just plain wrong here? > >> +#if defined(__GNUC__)&& (__GNUC__< 4 || (__GNUC__ == 4&& >&g
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Hi, Siarhei Siamashka: Thanks for quick response! For 64 bit detect_cpu_features(), if ignore HAVE_GETISAX and _MSC_VER, it is ok for us to simplify it as your example in next update. For pixman-ssse3.c, maybe we have 2 options: 1) duplicate 6562 lines from pixman-sse2.c to new pixman-ssse3.c in 1st patch (of course to replace 2 entries with newly added SSSE3 asm optimization), and then add "merge" mechanism in later patch. 2) firstly add "merge" mechanism patch, and the added new pixman-ssse3.c in later patch, which might be shorter (111 lines) Does it mean 1) option is preferred? We will response soon after your guide on how to split SSSE3 C file. Samuel -Original Message- From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] Sent: Friday, August 27, 2010 6:00 PM To: Xu, Samuel Cc: pixman@lists.freedesktop.org; Ma, Ling; Liu, Xinyun Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8 On Friday 27 August 2010 05:59:00 Xu, Samuel wrote: > Hi Siarhei Siamashka, > > Here is a new patch, can you review it? Thank you! > It address following suggestions: > 1: SSSE3 file is split to a new file. Thanks. > Comparing with to duplicate every > content from SSE2 file, I added a way to merge new fast path function > pointer table runtimely by CPUID check, to avoid huge code expansion > and a bug you pointed out.(Without runtime merging, I have to > duplicate every line from SSE2 file to SSSE3 file, it is really not > graceful and redundant. Just implementing everything in the same way as done in SSE2 code, you would get a proper delegate chain trying SSSE3 fast paths first, then trying SSE2, MMX, C fast paths (compositing in one pass) and finally fallback to a slow general path code (fetch, combine and store steps done in separate passes). The runtime merging is not strictly needed, even though it could probably improve performance somewhat. But this clearly belongs to a separate patch. > 2: Copy right information added > 3: Makefile fix > 4: author name fix > > For detect_cpu_features(), we possibly can remove the TWO lines asm > code to copy edx to "result", to avoid SSE2 and MMX checking, since > most preparing code still need. If we keep this 2 lines, it still give > us some possibility to check other edx information of CPUID in the > future. It is only executed once when pixman is loaded. Wouldn't it be just as simple as doing the following in the block of code under __GNUC__ and __amd64__ or __x86_64__ ifdefs? int have_ssse3_support () { uint32_t cpuid_ecx; __asm__ ( "mov $1, %%eax\n" "cpuid\n" "mov %%ecx, %0\n" : "=r" (cpuid_ecx) : : "%rax", "%rbx", "%rcx", "%rdx" ); return cpuid_ecx & (1 << 9); } And we already know that we have MMX and SSE2 on all x86_64 hardware even without checking. MSVC and any compilers other than GNU C are probably not so interesting initially. They just should not fail when building pixman. > For ASM code of 'bk_write' and 'table_48_bytes_bwd', here is the > reason of "why we use backward copy mode in the patch ", from Ling: > All read operations after allocation stage can run speculatively, all > write operation will run in program order, and if addresses are > different read may run before older write operation, otherwise wait until > write commit. > However CPU don't check each address bit, so read could fail to > recognize different address even theyare in different page. For > example if rsi is 0xf004, rdi is 0xe008,in following operation there will > generate big > performance latency. 1. movq (%rsi), %rax > 2. movq %rax, (%rdi) > 3. movq 8(%rsi), %rax > 4. movq %rax, 8(%rdi) > > If %rsi and rdi were in really the same meory page, there are TRUE > read-after-write dependence because instruction 2 write 0x008 and > instruction 3 read 0x00c, the two address are overlap partially. > Actually there are in different page and no any issues, but without > checking each address bit CPU could think they are in the same page, > and instruction 3 have to wait for instruction 2 to write data into > cache from write buffer, then load data from cache, the cost time read > spent is equal to mfence instruction. We may avoid it by tuning operation > sequence as follow. > 1. movq 8(%rsi), %rax > 2. movq %rax, 8(%rdi) > 3. movq (%rsi), %rax > 4. movq %rax, (%rdi) > Instruction 3 read 0x004, instruction 2 write address 0x010, no any > dependence. In this patch we first handle small size(less 48bytes) by > this way, and the approach help us
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Friday 27 August 2010 05:59:00 Xu, Samuel wrote: > Hi Siarhei Siamashka, > > Here is a new patch, can you review it? Thank you! > It address following suggestions: > 1: SSSE3 file is split to a new file. Thanks. > Comparing with to duplicate every > content from SSE2 file, I added a way to merge new fast path function > pointer table runtimely by CPUID check, to avoid huge code expansion and a > bug you pointed out.(Without runtime merging, I have to duplicate every > line from SSE2 file to SSSE3 file, it is really not graceful and > redundant. Just implementing everything in the same way as done in SSE2 code, you would get a proper delegate chain trying SSSE3 fast paths first, then trying SSE2, MMX, C fast paths (compositing in one pass) and finally fallback to a slow general path code (fetch, combine and store steps done in separate passes). The runtime merging is not strictly needed, even though it could probably improve performance somewhat. But this clearly belongs to a separate patch. > 2: Copy right information added > 3: Makefile fix > 4: author name fix > > For detect_cpu_features(), we possibly can remove the TWO lines asm code to > copy edx to "result", to avoid SSE2 and MMX checking, since most preparing > code still need. If we keep this 2 lines, it still give us some > possibility to check other edx information of CPUID in the future. It is > only executed once when pixman is loaded. Wouldn't it be just as simple as doing the following in the block of code under __GNUC__ and __amd64__ or __x86_64__ ifdefs? int have_ssse3_support () { uint32_t cpuid_ecx; __asm__ ( "mov $1, %%eax\n" "cpuid\n" "mov %%ecx, %0\n" : "=r" (cpuid_ecx) : : "%rax", "%rbx", "%rcx", "%rdx" ); return cpuid_ecx & (1 << 9); } And we already know that we have MMX and SSE2 on all x86_64 hardware even without checking. MSVC and any compilers other than GNU C are probably not so interesting initially. They just should not fail when building pixman. > For ASM code of 'bk_write' and 'table_48_bytes_bwd', here is the reason of > "why we use backward copy mode in the patch ", from Ling: All read > operations after allocation stage can run speculatively, all write > operation will run in program order, and if addresses are different read > may run before older write operation, otherwise wait until write commit. > However CPU don't check each address bit, so read could fail to recognize > different address even theyare in different page. For example if rsi is > 0xf004, rdi is 0xe008,in following operation there will generate big > performance latency. 1. movq (%rsi), %rax > 2. movq %rax, (%rdi) > 3. movq 8(%rsi), %rax > 4. movq %rax, 8(%rdi) > > If %rsi and rdi were in really the same meory page, there are TRUE > read-after-write dependence because instruction 2 write 0x008 and > instruction 3 read 0x00c, the two address are overlap partially. Actually > there are in different page and no any issues, but without checking each > address bit CPU could think they are in the same page, and instruction 3 > have to wait for instruction 2 to write data into cache from write buffer, > then load data from cache, the cost time read spent is equal to mfence > instruction. We may avoid it by tuning operation sequence as follow. > 1. movq 8(%rsi), %rax > 2. movq %rax, 8(%rdi) > 3. movq (%rsi), %rax > 4. movq %rax, (%rdi) > Instruction 3 read 0x004, instruction 2 write address 0x010, no any > dependence. In this patch we first handle small size(less 48bytes) by this > way, and the approach help us improve up to 2X on Atom. Thanks for the explanation. Some comments describing why it was done this way in the code for such non-obvious cases would be generally nice. So it is basically a store forwarding aliasing problem which is described in "12.3.3.1 Store Forwarding" section from "Intel® 64 and IA-32 Architectures Optimization Reference Manual", right? Wouldn't just the use of MOVD/MOVSS instructions here also solve this problem? Store forwarding does not seem to be used for SIMD according to the manual. I haven't benchmarked anything yet though. Moreover, if you have plans to generalize and extend SSSE3 optimizations to accelerate more stuff later, then something like 'src__' and 'add__' operations would be the really easy ones. But it would be very nice to avoid dumb copy/paste and 3x increase of assembly sources size. For this to go smoothly, preferably you would need to separate the loop code itself from the pixel processing logic. And in order to make pixel processing more simple and consistent, it's better to do all the pixel processing using SIMD instructions (both inner loop and also leading/trailing pixels). Leading and trailing unaligned pixels may be probably handled by using MOVD/MOVSS/MOVHPS or other similar instructions to do partial SSE register load/store operations. It's not like this
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Hi, Xu. > --- /dev/null > +++ b/pixman/pixman-access-ssse3_x86-64.S .. > +#if (defined(__amd64__) || defined(__x86_64__) ||defined(_M_AMD64)) why _M_AMD64? _M_AMD64 macro is for MSVC++ x64 version. This assembler is for GNU as. Also, if MSVC++ 8.0 (14.00) or later, we should use __cpuid for x64 and x86 compatibility. http://msdn.microsoft.com/ja-jp/library/hskdteyh.aspx -- Makoto On 2010/08/27 11:59, Xu, Samuel wrote: Hi Siarhei Siamashka, Here is a new patch, can you review it? Thank you! It address following suggestions: 1: SSSE3 file is split to a new file. Comparing with to duplicate every content from SSE2 file, I added a way to merge new fast path function pointer table runtimely by CPUID check, to avoid huge code expansion and a bug you pointed out.(Without runtime merging, I have to duplicate every line from SSE2 file to SSSE3 file, it is really not graceful and redundant. 2: Copy right information added 3: Makefile fix 4: author name fix For detect_cpu_features(), we possibly can remove the TWO lines asm code to copy edx to "result", to avoid SSE2 and MMX checking, since most preparing code still need. If we keep this 2 lines, it still give us some possibility to check other edx information of CPUID in the future. It is only executed once when pixman is loaded. For ASM code of 'bk_write' and 'table_48_bytes_bwd', here is the reason of "why we use backward copy mode in the patch ", from Ling: All read operations after allocation stage can run speculatively, all write operation will run in program order, and if addresses are different read may run before older write operation, otherwise wait until write commit. However CPU don't check each address bit, so read could fail to recognize different address even theyare in different page. For example if rsi is 0xf004, rdi is 0xe008,in following operation there will generate big performance latency. 1. movq (%rsi), %rax 2. movq %rax, (%rdi) 3. movq 8(%rsi),%rax 4. movq %rax, 8(%rdi) If %rsi and rdi were in really the same meory page, there are TRUE read-after-write dependence because instruction 2 write 0x008 and instruction 3 read 0x00c, the two address are overlap partially. Actually there are in different page and no any issues, but without checking each address bit CPU could think they are in the same page, and instruction 3 have to wait for instruction 2 to write data into cache from write buffer, then load data from cache, the cost time read spent is equal to mfence instruction. We may avoid it by tuning operation sequence as follow. 1. movq 8(%rsi), %rax 2. movq %rax, 8(%rdi) 3. movq (%rsi), %rax 4. movq %rax, (%rdi) Instruction 3 read 0x004, instruction 2 write address 0x010, no any dependence. In this patch we first handle small size(less 48bytes) by this way, and the approach help us improve up to 2X on Atom. Regards, Liu, Xinyun -Original Message- From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] Sent: Sunday, August 22, 2010 1:49 AM To: Liu, Xinyun Cc: pixman@lists.freedesktop.org; Ma, Ling; Xu, Samuel Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8 On Friday 20 August 2010 18:39:47 Liu, Xinyun wrote: Hi Siarhei Siamashka, Here is a new patch, can you review it? Thank you! Sure, thanks for the updated patch. Some comments follow. From 9783651899a2763d7fcca2960fc354bd1f541980 Mon Sep 17 00:00:00 2001 From: root A minor nitpick here. The 'From:' header seems to be wrong, it does not look like a real patch author name. +dnl Check for SSSE3 + +if test "x$SSSE3_CFLAGS" = "x" ; then + if test "x$SUNCC" = "xyes"; then + # SSSE3 is enabled by default in the Sun Studio 64-bit +environment Is it a valid statement with the regards to Sun Studio? Looks like it is a direct copy/paste from SSE2. Can we be sure that there is nothing redundant or just plain wrong here? +#if defined(__GNUC__)&& (__GNUC__< 4 || (__GNUC__ == 4&& +__GNUC_MINOR__< 3)) +# if !defined(__amd64__)&& !defined(__x86_64__) Is this 'defined(__amd64__)' check needed here? Looks like it is again a copy/paste from SSE2 code, but SSE2 is guaranteed to be supported for x86-64 systems. +# error "Need GCC>= 4.3 for SSSE3 intrinsics on x86" +# endif +#endif diff --git a/pixman/pixman-access-ssse3_x86-64.S b/pixman/pixman-access- ssse3_x86-64.S new file mode 100644 index 000..fec93df --- /dev/null +++ b/pixman/pixman-access-ssse3_x86-64.S Copyright notice seems to be still missing in the newly added source files. + cmp %dil, %sil + jb L(bk_write) + add %rdx, %rsi + add %rdx, %rdi + BRANCH_TO_JMPTBL_ENTRY (L(table_48bytes_fwd), %rdx, 4) +L(bk_write): + BRANCH_TO_JMPTBL_ENTRY (L(table_48_bytes_bwd), %rdx, 4) Would it ma
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Friday 20 August 2010 18:39:47 Liu, Xinyun wrote: > Hi Siarhei Siamashka, > > Here is a new patch, can you review it? Thank you! Sure, thanks for the updated patch. Some comments follow. > From 9783651899a2763d7fcca2960fc354bd1f541980 Mon Sep 17 00:00:00 2001 > From: root A minor nitpick here. The 'From:' header seems to be wrong, it does not look like a real patch author name. > +dnl Check for SSSE3 > + > +if test "x$SSSE3_CFLAGS" = "x" ; then > + if test "x$SUNCC" = "xyes"; then > + # SSSE3 is enabled by default in the Sun Studio 64-bit environment Is it a valid statement with the regards to Sun Studio? Looks like it is a direct copy/paste from SSE2. Can we be sure that there is nothing redundant or just plain wrong here? > +#if defined(__GNUC__) && (__GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ < 3)) > +# if !defined(__amd64__) && !defined(__x86_64__) Is this 'defined(__amd64__)' check needed here? Looks like it is again a copy/paste from SSE2 code, but SSE2 is guaranteed to be supported for x86-64 systems. > +# error "Need GCC >= 4.3 for SSSE3 intrinsics on x86" > +# endif > +#endif > diff --git a/pixman/pixman-access-ssse3_x86-64.S b/pixman/pixman-access- ssse3_x86-64.S > new file mode 100644 > index 000..fec93df > --- /dev/null > +++ b/pixman/pixman-access-ssse3_x86-64.S Copyright notice seems to be still missing in the newly added source files. > + cmp %dil, %sil > + jb L(bk_write) > + add %rdx, %rsi > + add %rdx, %rdi > + BRANCH_TO_JMPTBL_ENTRY (L(table_48bytes_fwd), %rdx, 4) > +L(bk_write): > + BRANCH_TO_JMPTBL_ENTRY (L(table_48_bytes_bwd), %rdx, 4) Would it make sense to also purge this 'bk_write' and 'table_48_bytes_bwd' stuff? > diff --git a/pixman/pixman-cpu.c b/pixman/pixman-cpu.c > index e4fb1e4..ab2d69a 100644 [...] > +#else /* end dt_cpu32() */ > + > +static unsigned int > +detect_cpu_features (void) > +{ > +unsigned int features = 0; > +unsigned int result = 0; > +unsigned int result_c = 0; This looks like a duplication of the code to make a new 64-bit version of 'detect_cpu_features' function. Could we assume that all 64-bit processors support CPUID instruction (so that the check for CPUID support can be omitted), and also support MMX and SSE2? In this case SSSE3 check will be just a tiny block of inline assembly using CPUID instruction. > diff --git a/pixman/pixman-sse2.c b/pixman/pixman-sse2.c > index 3dd7967..0cfd554 100644 > --- a/pixman/pixman-sse2.c > +++ b/pixman/pixman-sse2.c > @@ -3520,6 +3520,56 @@ sse2_composite_over__n_ (pixman_implementation_t *imp, > /*- > * composite_over__n_ > */ > +#if defined(USE_SSSE3) && !defined(_MSC_VER) > + > +extern void *composite_src_x888__ssse3_fast_path( uint32_t * dest, uint32_t *src, int32_t count); > +extern pixman_bool_t choose_ssse3_fast_patch; > +static void > +ssse3_composite_src_x888_ (pixman_implementation_t *imp, The addition of 'ssse3_composite_src_x888_' into 'pixman-sse2.c' does not look like a good idea. Adding a new 'pixman-ssse3.c' file for SSSE3 optimizations would have been a better choice. Especially considering the following part of code: > +#if defined(USE_SSSE3) && !defined(_MSC_VER) > +PIXMAN_STD_FAST_PATH (SRC, x8r8g8b8, null, a8r8g8b8, ssse3_composite_src_x888_), > +PIXMAN_STD_FAST_PATH (SRC, x8b8g8r8, null, a8b8g8r8, ssse3_composite_src_x888_), > +#else > PIXMAN_STD_FAST_PATH (SRC, x8r8g8b8, null, a8r8g8b8, sse2_composite_src_x888_), > PIXMAN_STD_FAST_PATH (SRC, x8b8g8r8, null, a8b8g8r8, sse2_composite_src_x888_), > +#endif By using "#if defined(USE_SSSE3)" here, you effectively disable 'sse2_composite_src_x888_' fast path for the processors which do not have SSSE3 support. If the runtime detection of SSSE3 fails, the code fallbacks to C implementation and we have no chance to benefit from SSE2 optimizations. -- Best regards, Siarhei Siamashka signature.asc Description: This is a digitally signed message part. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Friday 20 August 2010 19:36:07 Xu, Samuel wrote: > We measured performance, and compared with original SSE2 intrinsic enabled > version(0.19.4), on ATOM, and get following findings using 480P flash > H.264 video playing workload: > 1) sse2_composite_src_x888_()'s cycle reduced 67%. This function's total > cycle ratio over whole system reduced from 5.6% to 1.9% This is not directly related to your pixman patch, but looks like the right place to fix the performance problem in your flash use case is YUV->RGB conversion. Setting alpha channel to 0xFF there would be the most efficient and src_x888_ operation could be totally eliminated. That said, improving pixman performance in general is still welcome and is a nice thing to have (for the other common use cases at least). I see that you dropped ssse3 x8r8g8b8 fetcher optimization in your last patch and added ssse3 src_x888_ fast path instead. It's a bit sad, because other operations like over__x888 with bilinear scaling could also benefit from the optimized fetcher for example. On the other hand, the lack of clarity regarding how to add SIMD optimized fetchers is avoided this way ;) > 2) whole system's C0 percentage reduced from 68.0% to 62.6% > Maybe it is not " dramatically", while we are glad to see those gain on > both perf and power. A peformance gain in the 4-5% ballpark looks like a major improvement to me. -- Best regards, Siarhei Siamashka signature.asc Description: This is a digitally signed message part. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Hi, Soeren Sandmann and Siarhei Siamashka: This patch is still for #20709. We already aware there are some SSE2 intrinsic is trying to addressing similar performance issue in recent new commit. Where this intrinsic optimization for x8r8g8b8 ops is not good enough, and sometimes even worse, on ATOM. In this patch, we provide SSSE3 optimized code, which co-exists with current SSE2 intrinsic, will effect when expected SSSE3 CPUID detected. Considering Siarhei Siamashka's suggestions, here are some enhancements comparing with pervious patch: 1) 32 bit and 64 bit assemble code of highly optimized memory move + logical AND, using SSSE3 2) SSSE3 runtime detection 3) Stack executable issue is avoided 4) make asm code shorter (almost half of pervious patch) We measured performance, and compared with original SSE2 intrinsic enabled version(0.19.4), on ATOM, and get following findings using 480P flash H.264 video playing workload: 1) sse2_composite_src_x888_()'s cycle reduced 67%. This function's total cycle ratio over whole system reduced from 5.6% to 1.9% 2) whole system's C0 percentage reduced from 68.0% to 62.6% Maybe it is not " dramatically", while we are glad to see those gain on both perf and power. BTW, we build and ran make check on following 3 systems: 1: 32 bit ATOM system, with SSSE3 2: 64 bit ATOM system, with SSSE3 3: 64 bit system, without SSSE3 Thanks! Samuel -Original Message- From: Liu, Xinyun [mailto:xinyun...@gmail.com] Sent: Friday, August 20, 2010 11:40 PM To: Siarhei Siamashka; pixman@lists.freedesktop.org Cc: Ma, Ling; Xu, Samuel Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8 Hi Siarhei Siamashka, Here is a new patch, can you review it? Thank you! With this patch, opfile said that the performance is increased dramatically for Atom. Samuel and Ling will provide detailed data. Regards, Liu, Xinyun ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Hi Siarhei Siamashka, Here is a new patch, can you review it? Thank you! With this patch, opfile said that the performance is increased dramatically for Atom. Samuel and Ling will provide detailed data. Regards, Liu, Xinyun 0001-Add-ssse3_composite_src_x888_.patch Description: Binary data ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Tuesday 17 August 2010 10:52:43 Xu, Samuel wrote: > We'd like to provide a new patch with following enhancement soon: > 1) Add 64 bit asm code specifically for 64 bit, which will co-exist with 32 > bit version > 2) CPUID dynamic check in pixman-cpu.c and pixman-access.c > 3) Makefile fixing > 4) Stack executable fixing in both 32/64 asm files. > 5) follow your comments to make asm code shorter Sounds good. The way how these optimizations get hooked into 'pixman-access.c' may need some extra tweaks in the end (they need to be approved by Soeren Sandmann), but the rest of the changes seem quite reasonable to me. I still wonder, would a common unified 32/64 bit SSSE3 assembly source be possible? Calling conventions are different, but the rest of code should be more or less the same. > Any missing? Also a proper copyright notice in the beginning of new files is needed. The text of the current recommended canonical pixman license can be found here: http://cgit.freedesktop.org/pixman/tree/COPYING > Two more questions: > "Maybe the intention was good, but the end result just fails to build > on my 64-bit Intel Atom netbook. And pixman "make check" dies with > "Illegal instruction" on a 32-bit Intel Pentium-M laptop. So it's a clear > indication that you did something wrong and the patch is unacceptable as > is." > > Q1: Could you past the 64-bit ATOM build failure? Is it caused by > ./configure to generate make file, or build error, or link error? It's just a normal x86-64 build failure related to pixman-access-ssse3.S: $ git clone git://anongit.freedesktop.org/pixman $ cd pixman $ patch -p1 < first_revision_of_ssse3_patch.diff $ ./autogen.sh [...] checking whether to use SSSE3 intrinsics... yes [...] $ make [...] CPPAS pixman-access-ssse3.lo pixman-access-ssse3.S: Assembler messages: pixman-access-ssse3.S:118: Error: suffix or operands invalid for `push' pixman-access-ssse3.S:118: Error: bad register expression pixman-access-ssse3.S:118: Error: register save offset not a multiple of 8 pixman-access-ssse3.S:130: Error: suffix or operands invalid for `jmp' pixman-access-ssse3.S:132: Error: suffix or operands invalid for `jmp' pixman-access-ssse3.S:138: Error: suffix or operands invalid for `push' pixman-access-ssse3.S:138: Error: bad register expression pixman-access-ssse3.S:143: Error: suffix or operands invalid for `push' pixman-access-ssse3.S:143: Error: bad register expression [...] > Q2: > Could you show us how to do "make check"? Same as above, but also run 'make check' after 'make'. To be sure that the tests are working properly and the new code is actually used, you may deliberately add a bug into your new code for testing purposes and verify that 'make check' detects it. It may be also convenient to configure pixman with '--disable-shared' option to make debugging easier. -- Best regards, Siarhei Siamashka signature.asc Description: This is a digitally signed message part. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Hi Siarhei Siamashka: We'd like to provide a new patch with following enhancement soon: 1) Add 64 bit asm code specifically for 64 bit, which will co-exist with 32 bit version 2) CPUID dynamic check in pixman-cpu.c and pixman-access.c 3) Makefile fixing 4) Stack executable fixing in both 32/64 asm files. 5) follow your comments to make asm code shorter Any missing? Two more questions: "Maybe the intention was good, but the end result just fails to build on my 64-bit Intel Atom netbook. And pixman "make check" dies with "Illegal instruction" on a 32-bit Intel Pentium-M laptop. So it's a clear indication that you did something wrong and the patch is unacceptable as is." Q1: Could you past the 64-bit ATOM build failure? Is it caused by ./configure to generate make file, or build error, or link error? Q2: Could you show us how to do "make check"? Samuel -Original Message- From: Ma, Ling Sent: Tuesday, August 17, 2010 3:22 PM To: Siarhei Siamashka; Xu, Samuel Cc: pixman@lists.freedesktop.org; Liu, Xinyun Subject: RE: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8 Hi Siarhei Siamashka > I did not question the fact that PALIGNR instruction must be useful > for something. There must have been some reason why it was introduced > :) But real benchmark numbers are always interesting, like the ones > you provided for > SSSE3 against original C code comparison. Just to be sure that these > more than 1000 lines of code are really justified (at least for this > particular use). We attached data report about movups + movaps Vs movaps + palignr. The report shows movaps + palignr is better because crossing-cacheline will cause big latency on Atom, Core2. > As I already mentioned, the code in 'pixman-access-ssse3.S' is highly > redundant just for 'fetch_scanline_x8r8g8b8' implementation. Due to > processing 32-bit data, there are only 4 possible cases of relative > source/destination alignment, but the whole set of 16 alignment cases > is implemented. In addition to other benefits (like improved sources > readability), reducing code size is good for saving space in the CPU > instruction cache and this typically improves performance. Original patch can tolerate any type of offset from src and dest. Yes, I agree if offset of src and dest must be 4 bytes aligned, Your suggestion will reduce code size significantly. Other questions, Samuel or Xinyun will provide further answers. Best Regards Ling > > 2) For #ifdefs vs. dynamically checking SSSE3, the patch exactly > > mimics "SSE2" support in currently pixman code(#ifdefs), which > > checks the build system when makefile generating and applies > > USE_SSSE3 into later building. > > Maybe the intention was good, but the end result just fails to build > on my 64-bit Intel Atom netbook. And pixman "make check" dies with > "Illegal instruction" on a 32-bit Intel Pentium-M laptop. So it's a > clear indication that you did something wrong and the patch is unacceptable > as is. > > > Of course prefix of CPUID can achieve better compatibility. It is ok > > for us to add this kind of CPUID prefix. I will appreciate directly > > comments on shape of this patch, using SSSE3, e.g. some code sample > > of expected pixman-cpu.c and > pixman-access.c. > > SSSE3 is not much different from MMX and SSE2 and can be added in a > similar way to pixman-cpu.c > > Regarding SIMD optimized fetchers. Pixman has none at the moment, so > there is no code to be used as a reference to see how it should be > done "right" (and that's partially the reason why bug#20709 is still > unresolved). > There was an older discussion about SIMD fetchers: > http://comments.gmane.org/gmane.comp.lib.cairo/18342 > > > 3) For executable stack caused by assemble file, it should be able > > to solved by adding ".section .note.GNU-stack .previous", according > > https://www.redhat.com/archives/fedora-devel-list/2005-March/msg00460. > > html . In fact, same solution already exist in current > > pixman-arm-neon-asm.S inside pixman code. > > Yes sure. I did not say that it can't be solved :) It's just better to > address this particular problem in the next revision of ssse3 patch. > > -- > Best regards, > Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Monday 16 August 2010 11:24:44 Xu, Samuel wrote: > Thanks for kindly comments! It is very nice that bug#20709 also emphasize > similar performance issue. Well, having bugs unresolved for such a long time is not so nice. Also see some more comments below. > So, let's discuss how to make this patch > better. Here is some answers from me, and Ma Ling will update comments on > 32/64 bit question and assemble details : > > 1) For MOVAPS+PALIGNR from SSSE3, referring chapter 12 of IA-32 Architectures Optimization Reference Manual> at > http://www.intel.com/Assets/PDF/manual/248966.pdf, we can know > MOVAPS+PALIGNR is preferred in ATOM: "Assembly /Compiler Coding Rule 13. > (MH impact, M generality) For Intel Atom processors, prefer a sequence > MOVAPS+PALIGN over MOVUPS. Similarly, MOVDQA+PALIGNR is preferred over > MOVDQU." As far as I know, this code pattern is friendly for ATOM and > won't harm Core-i7. I did not question the fact that PALIGNR instruction must be useful for something. There must have been some reason why it was introduced :) But real benchmark numbers are always interesting, like the ones you provided for SSSE3 against original C code comparison. Just to be sure that these more than 1000 lines of code are really justified (at least for this particular use). As I already mentioned, the code in 'pixman-access-ssse3.S' is highly redundant just for 'fetch_scanline_x8r8g8b8' implementation. Due to processing 32-bit data, there are only 4 possible cases of relative source/destination alignment, but the whole set of 16 alignment cases is implemented. In addition to other benefits (like improved sources readability), reducing code size is good for saving space in the CPU instruction cache and this typically improves performance. > 2) For #ifdefs vs. dynamically checking SSSE3, the > patch exactly mimics "SSE2" support in currently pixman code(#ifdefs), > which checks the build system when makefile generating and applies > USE_SSSE3 into later building. Maybe the intention was good, but the end result just fails to build on my 64-bit Intel Atom netbook. And pixman "make check" dies with "Illegal instruction" on a 32-bit Intel Pentium-M laptop. So it's a clear indication that you did something wrong and the patch is unacceptable as is. > Of course prefix of CPUID can achieve > better compatibility. It is ok for us to add this kind of CPUID prefix. I > will appreciate directly comments on shape of this patch, using SSSE3, > e.g. some code sample of expected pixman-cpu.c and pixman-access.c. SSSE3 is not much different from MMX and SSE2 and can be added in a similar way to pixman-cpu.c Regarding SIMD optimized fetchers. Pixman has none at the moment, so there is no code to be used as a reference to see how it should be done "right" (and that's partially the reason why bug#20709 is still unresolved). There was an older discussion about SIMD fetchers: http://comments.gmane.org/gmane.comp.lib.cairo/18342 > 3) For executable stack caused by assemble file, it should be able to solved > by adding ".section .note.GNU-stack .previous", according > https://www.redhat.com/archives/fedora-devel-list/2005-March/msg00460.html > . In fact, same solution already exist in current pixman-arm-neon-asm.S > inside pixman code. Yes sure. I did not say that it can't be solved :) It's just better to address this particular problem in the next revision of ssse3 patch. -- Best regards, Siarhei Siamashka signature.asc Description: This is a digitally signed message part. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
Hi, Siamashka: Thanks for kindly comments! It is very nice that bug#20709 also emphasize similar performance issue. So, let's discuss how to make this patch better. Here is some answers from me, and Ma Ling will update comments on 32/64 bit question and assemble details : 1) For MOVAPS+PALIGNR from SSSE3, referring chapter 12 of at http://www.intel.com/Assets/PDF/manual/248966.pdf, we can know MOVAPS+PALIGNR is preferred in ATOM: "Assembly/Compiler Coding Rule 13. (MH impact, M generality) For Intel Atom processors, prefer a sequence MOVAPS+PALIGN over MOVUPS. Similarly, MOVDQA+PALIGNR is preferred over MOVDQU." As far as I know, this code pattern is friendly for ATOM and won't harm Core-i7. 2) For #ifdefs vs. dynamically checking SSSE3, the patch exactly mimics "SSE2" support in currently pixman code(#ifdefs), which checks the build system when makefile generating and applies USE_SSSE3 into later building. Of course prefix of CPUID can achieve better compatibility. It is ok for us to add this kind of CPUID prefix. I will appreciate directly comments on shape of this patch, using SSSE3, e.g. some code sample of expected pixman-cpu.c and pixman-access.c. 3) For executable stack caused by assemble file, it should be able to solved by adding ".section .note.GNU-stack .previous", according https://www.redhat.com/archives/fedora-devel-list/2005-March/msg00460.html. In fact, same solution already exist in current pixman-arm-neon-asm.S inside pixman code. Thanks! Samuel -Original Message- From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] Sent: Saturday, August 14, 2010 1:54 AM To: pixman@lists.freedesktop.org Cc: Liu, Xinyun; Ma, Ling; Xu, Samuel Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8 On Wednesday 11 August 2010 09:00:54 Liu Xinyun wrote: > Hi, > > piman-access.c: fetch_scanline_x8r8g8b8() is mainly memcpy and 'or" > operations. With ssse3_memcpy, the performance is increased a little. > > Reference: http://bugs.meego.com/show_bug.cgi?id=5012 > > Quote: > > After optimization, we can find following speed up: > > 1.fetch_scanline_x8r8g8b8()'s cycles is reduced 84% during clips run. > > It now consumes 0.8% of whole system, comparing with 2.3% w/o > > optimization. 2.system C0 is reduced around 1.5% > > 3.Global CPU utilization is reduced around 1.5% of whole system > > Here is patch, please review it, thanks! Thanks for the patch. This performance problem has been known for some time already and there is a bug about it (with some useful comments): https://bugs.freedesktop.org/show_bug.cgi?id=20709 Out of curiosity, how much of the performance gain is provided by using MOVAPS+PALIGNR from SSSE3 over just a plain SSE2 implementation via MOVAPS+MOVUPS on Intel Atom? Is it also beneficial on other processors supporting SSSE3 such as Intel Core i7? That said, I surely appreciate the intent of getting the absolutely best performance whenever it is possible. I also tried to find the best possible memory access pattern myself for ARM NEON on OMAP3430 SoC, and then generalize the code to support many types of different compositing operations (in 'pixman-arm-neon-asm.*'). Now back to your patch. It is not a compete review, I just wanted to comment a few things which stick out at the moment. As I see, the biggest problems in it are: 1. This patch introduces pixman build failure on 64-bit systems, because 'pixman-access-ssse3.S' is not 64-bit aware and makes a lot of 32-bit assumptions. 2. Runtime detection of SSSE3 should be used instead of #ifdefs in the code. Right now, applying this patch will just make default pixman build unusable on the processors which do not support SSSE3. 3. Linking 'pixman-access-ssse3.o' object file with the others enforces the use of executable stack, which is a bad thing. More details can be found here for example: http://www.gentoo.org/proj/en/hardened/gnu-stack.xml Less important things spotted by reading the code are below. > From 30514afe4d0af3a476ec8dcea494f7e216f0fc9d Mon Sep 17 00:00:00 2001 > From: Liu Xinyun > Date: Wed, 11 Aug 2010 16:31:47 +0800 > Subject: [PATCH] Optimize fetch_scanline_x8r8g8b8 with SSSE3 > instruction > > Signed-off-by: Liu Xinyun > Signed-off-by: Xu, Samuel > Signed-off-by: Ma, ling > --- > configure.ac | 56 +++ > pixman/Makefile.am | 14 + > pixman/pixman-access-ssse3.S | 1119 > ++ pixman/pixman-access.c | > 6 + > 4 files changed, 1195 insertions(+), 0 deletions(-) create mode > 100644 pixman/pixman-access-ssse3.S [...] > + .section .text.ssse3,"ax",@progbits > +ENTRY (MEMCPY_OR) > + ENTRANCE > + movlLEN(%esp)
Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
On Wednesday 11 August 2010 09:00:54 Liu Xinyun wrote: > Hi, > > piman-access.c: fetch_scanline_x8r8g8b8() is mainly memcpy and 'or" > operations. With ssse3_memcpy, the performance is increased a little. > > Reference: http://bugs.meego.com/show_bug.cgi?id=5012 > > Quote: > > After optimization, we can find following speed up: > > 1.fetch_scanline_x8r8g8b8()'s cycles is reduced 84% during clips run. > > It now consumes 0.8% of whole system, comparing with 2.3% w/o > > optimization. 2.system C0 is reduced around 1.5% > > 3.Global CPU utilization is reduced around 1.5% of whole system > > Here is patch, please review it, thanks! Thanks for the patch. This performance problem has been known for some time already and there is a bug about it (with some useful comments): https://bugs.freedesktop.org/show_bug.cgi?id=20709 Out of curiosity, how much of the performance gain is provided by using MOVAPS+PALIGNR from SSSE3 over just a plain SSE2 implementation via MOVUPS on Intel Atom? Is it also beneficial on other processors supporting SSSE3 such as Intel Core i7? That said, I surely appreciate the intent of getting the absolutely best performance whenever it is possible. I also tried to find the best possible memory access pattern myself for ARM NEON on OMAP3430 SoC, and then generalize the code to support many types of different compositing operations (in 'pixman-arm-neon-asm.*'). Now back to your patch. It is not a compete review, I just wanted to comment a few things which stick out at the moment. As I see, the biggest problems in it are: 1. This patch introduces pixman build failure on 64-bit systems, because 'pixman-access-ssse3.S' is not 64-bit aware and makes a lot of 32-bit assumptions. 2. Runtime detection of SSSE3 should be used instead of #ifdefs in the code. Right now, applying this patch will just make default pixman build unusable on the processors which do not support SSSE3. 3. Linking 'pixman-access-ssse3.o' object file with the others enforces the use of executable stack, which is a bad thing. More details can be found here for example: http://www.gentoo.org/proj/en/hardened/gnu-stack.xml Less important things spotted by reading the code are below. > From 30514afe4d0af3a476ec8dcea494f7e216f0fc9d Mon Sep 17 00:00:00 2001 > From: Liu Xinyun > Date: Wed, 11 Aug 2010 16:31:47 +0800 > Subject: [PATCH] Optimize fetch_scanline_x8r8g8b8 with SSSE3 instruction > > Signed-off-by: Liu Xinyun > Signed-off-by: Xu, Samuel > Signed-off-by: Ma, ling > --- > configure.ac | 56 +++ > pixman/Makefile.am | 14 + > pixman/pixman-access-ssse3.S | 1119 > ++ pixman/pixman-access.c | > 6 + > 4 files changed, 1195 insertions(+), 0 deletions(-) > create mode 100644 pixman/pixman-access-ssse3.S [...] > + .section .text.ssse3,"ax",@progbits > +ENTRY (MEMCPY_OR) > + ENTRANCE > + movlLEN(%esp), %ecx > + movlSRC(%esp), %eax > + movlDEST(%esp), %edx > + > + cmp $48, %ecx > + jae L(48bytesormore) > + > + cmp %dl, %al > + jb L(bk_write) > + add %ecx, %edx > + add %ecx, %eax > + BRANCH_TO_JMPTBL_ENTRY (L(table_48bytes_fwd), %ecx, 4) > +L(bk_write): > + BRANCH_TO_JMPTBL_ENTRY (L(table_48_bytes_bwd), %ecx, 4) Backwards copy is not strictly needed for 'fetch_scanline_x8r8g8b8' at the moment. And probably will never be because scanlines get fetched into a temporary buffer. > + > + ALIGN (4) > +/* ECX > 32 and EDX is 4 byte aligned. */ > +L(48bytesormore): > + movdqu (%eax), %xmm0 > + PUSH (%edi) > + mov $0xff00, %edi > + movd%edi, %xmm6 > + movl%edx, %edi > + and $-16, %edx > + PUSH (%esi) > + add $16, %edx > + movl%edi, %esi > + sub %edx, %edi > + add %edi, %ecx > + sub %edi, %eax > + > + mov %esi, %edi > + pshufd $0, %xmm6, %xmm6 > + and $3, %edi > + por %xmm6, %xmm0 > + jz L(aligned4bytes) > + cmp $3, %edi > + psrldq $1, %xmm6 > + jz L(aligned4bytes) > + cmp $2, %edi > + psrldq $1, %xmm6 > + jz L(aligned4bytes) > + psrldq $1, %xmm6 Is there any way for the value in EDI register to be not a multiple of 4 here? Both source and destination are guaranteed to be 4 bytes aligned because they are pointers to uint32_t. This seems to be a bit redundant here (it looks like a leftover from memcpy/memmove implementation which was used as a templete for this code). The negative impact is a slight runtime overhead (especially for short scanlines) and bigger code size. [...] > L(shl_1): [...] > L(shl_1_loop): [...] > L(shl_2): [...] > L(shl_2_loop): [...] The rest of code contains a lot of repeatable patterns (the main loop is replicated 16 times for different alignments). IMHO they could be simplified a lot by using macros, making the code si