Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-12-02 Thread Liu Xinyun
I have a similar performance result got about one month ago.

Gvim test has great performance increase, about 360%. Other tests has no side
effect and no increase neither.

I tested it on 32-bit userland. I will test it again based on the git code and 
give out the data later.

Regards,
Xinyun

On Fri, Dec 03, 2010 at 12:00:22AM +0800, Siarhei Siamashka wrote:
 
> Just did some benchmarks with the 'gnome-system-monitor' cairo perf trace
> which happens to use 'src_x888_' operation. It looks like SSSE3 has
> about the same performance as SSE2, providing no measurable benefits on
> this use case. The low complexity of this operation and the hardware
> prefetcher make any differences between implementations much less
> noticeable in general.
> 
> Intel Atom N450, 64-bit userland, gcc 4.5.1
> 
> $ CAIRO_TEST_TARGET=image ./cairo-perf-trace gnome-system-monitor.trace 
> 
> === C slow path =
> 
> [ # ]  backend test   min(s) median(s) stddev. count
> [ # ]image: pixman 0.21.3
> [  0]image gnome-system-monitor   15.011   15.034   0.08%6/6
> 
> === C fast path [1] =
> 
> [ # ]  backend test   min(s) median(s) stddev. count
> [ # ]image: pixman 0.21.3
> [  0]image gnome-system-monitor   14.659   14.697   0.20%6/6
> 
> === SSE2 fast path [2] =
> 
> [ # ]  backend test   min(s) median(s) stddev. count
> [ # ]image: pixman 0.21.3
> [  0]image gnome-system-monitor   14.431   14.496   0.19%6/6
> 
> === SSSE3 fast path [3] ==
> 
> [ # ]  backend test   min(s) median(s) stddev. count
> [ # ]image: pixman 0.21.3
> [  0]image gnome-system-monitor   14.455   14.496   0.17%6/6
> 
> == artificial test with just an empty stub for src_x888_ 
> 
> [ # ]  backend test   min(s) median(s) stddev. count
> [ # ]image: pixman 0.21.3
> [  0]image gnome-system-monitor   12.215   12.241   0.11%6/6
> 
> ---
> 
> So I'm not sure if this SSE3 code is very useful as is. But of course, if some
> practical use case makes a heavy use of this function so that it works on the
> data residing in L1/L2 caches, then it could show a big improvement.
> 
> 
> 1. http://cgit.freedesktop.org/pixman/commit/?id=16bae834
> 2. http://cgit.freedesktop.org/pixman/commit/?id=e0b430a1
> 3. http://lists.freedesktop.org/archives/pixman/2010-November/000742.html
> 
> -- 
> Best regards,
> Siarhei Siamashka


___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-12-02 Thread Siarhei Siamashka
On Monday 29 November 2010 20:59:52 Siarhei Siamashka wrote:
> On Wednesday 17 November 2010 07:47:39 Xu, Samuel wrote:
> > For MOVD, we simplified the backward copy code since pervious code is too
> > long and not gain obvious performance,
> 
> And this is what I'm worried about. First you proposed this part of code
> (backwards copy) without providing any initial explanation or benchmark
> numbers. And now you abandon it, also without providing any benchmark
> numbers or description of your test. But because you still have x86
> instructions there instead of MOVD, store forwarding aliasing surely may
> affect performance, as demonstrated by my simple test program posted
> earlier [3] as part of the discussion.
> 
> Can we be sure that this SSSE3 patch actually provides any practical
> performance improvement over the current pixman code? Considering that
> compared to your initial profiling data, now SSE2 optimized code is used
> for this operation and also problematic software prefetch which was almost
> halving [4] effective memory bandwidth for this operation is now
> eliminated [5].

> Surely, using 'test/lowlevel-blt-bench' microbenchmark it showed good
> improvement for L1 cached data. But on the other hand, performance of small
> random operations has dropped (shown as 'R' statistics). Maybe that's
> happening because of the scanline function call overhead and this fwd/bwd
> special handling for small buffer sizes? So a final verification with some
> real practical use case would be interesting.
>
> [...]
>
> 1. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00357.html
> 2. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00378.html
> 3. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00415.html
> 4. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00404.html
> 5. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00563.html

Just did some benchmarks with the 'gnome-system-monitor' cairo perf trace
which happens to use 'src_x888_' operation. It looks like SSSE3 has
about the same performance as SSE2, providing no measurable benefits on
this use case. The low complexity of this operation and the hardware
prefetcher make any differences between implementations much less
noticeable in general.

Intel Atom N450, 64-bit userland, gcc 4.5.1

$ CAIRO_TEST_TARGET=image ./cairo-perf-trace gnome-system-monitor.trace 

=== C slow path =

[ # ]  backend test   min(s) median(s) stddev. count
[ # ]image: pixman 0.21.3
[  0]image gnome-system-monitor   15.011   15.034   0.08%6/6

=== C fast path [1] =

[ # ]  backend test   min(s) median(s) stddev. count
[ # ]image: pixman 0.21.3
[  0]image gnome-system-monitor   14.659   14.697   0.20%6/6

=== SSE2 fast path [2] =

[ # ]  backend test   min(s) median(s) stddev. count
[ # ]image: pixman 0.21.3
[  0]image gnome-system-monitor   14.431   14.496   0.19%6/6

=== SSSE3 fast path [3] ==

[ # ]  backend test   min(s) median(s) stddev. count
[ # ]image: pixman 0.21.3
[  0]image gnome-system-monitor   14.455   14.496   0.17%6/6

== artificial test with just an empty stub for src_x888_ 

[ # ]  backend test   min(s) median(s) stddev. count
[ # ]image: pixman 0.21.3
[  0]image gnome-system-monitor   12.215   12.241   0.11%6/6

---

So I'm not sure if this SSE3 code is very useful as is. But of course, if some
practical use case makes a heavy use of this function so that it works on the
data residing in L1/L2 caches, then it could show a big improvement.


1. http://cgit.freedesktop.org/pixman/commit/?id=16bae834
2. http://cgit.freedesktop.org/pixman/commit/?id=e0b430a1
3. http://lists.freedesktop.org/archives/pixman/2010-November/000742.html

-- 
Best regards,
Siarhei Siamashka


signature.asc
Description: This is a digitally signed message part.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-11-29 Thread Siarhei Siamashka
On Wednesday 17 November 2010 07:47:39 Xu, Samuel wrote:
> Hi, Soeren Sandmann and Siarhei Siamashka:
>   Glad to send out this refreshed patch to address points we discussed 
> for this SSSE3 patch.

Thanks. And sorry for a rather late reply.

> In this new patch, we merged 32 bit and 64 bit asm code
> into unified code (Macros are moved to header file, and make them more
> reusable and easily to be read).

Actually I'm a bit concerned about how the code looks now. There are still 
similar looking parts in 32-bit and 64-bit implementation which could be
shared:

+/* the macro definition of shift copy in stage one . 32bytes.
+ * Two 16-bytes XMM register will be packed into another 16-byte by using 
palignr
+ * instruction. */
+.macro SHL_COPY_STAGE_ONE x
+   movaps  16(%rsi), %xmm2
+   sub $32, %rdx
+   movaps  32(%rsi), %xmm3
+   lea 32(%rsi), %rsi
+   movdqa  %xmm3, %xmm4
+   palignr \x, %xmm2, %xmm3
+   lea 32(%rdi), %rdi
+   palignr \x, %xmm1, %xmm2
+   por %xmm6, %xmm2
+   movaps  %xmm2, -32(%rdi)
+   por %xmm6, %xmm3
+   movaps  %xmm3, -16(%rdi)
+.endm

and

+/* the macro definition of shift copy in stage one . 32bytes */
+.macro SHL_COPY_STAGE_ONE x
+   movaps  16(%eax), %xmm2
+   sub $32, %ecx
+   movaps  32(%eax), %xmm3
+   lea 32(%eax), %eax
+   movdqa  %xmm3, %xmm4
+   palignr \x, %xmm2, %xmm3
+   lea 32(%edx), %edx
+   palignr \x, %xmm1, %xmm2
+   por %xmm6, %xmm2
+   movaps  %xmm2, -32(%edx)
+   por %xmm6, %xmm3
+   movaps  %xmm3, -16(%edx)
+.endm

The whole point of using macros is to make the code more maintainable, readable
and easy to extend. Just splitting the code in chunks, putting them into macros
with just the primary goal to reduce the total number of lines may actually
have an opposite effect and result in a rather obfuscated sources instead.

I don't see an obvious and clean way how to extend this code to support
more compositing operations.

After all, the implemented code is just only *one* of a huge variety of
compositing operations supported by pixman. And this particular operation
even should not be normally used much on properly written code. Basically, this
is a simple conversion from x8r8g8b8 -> a8r8g8b8 format (by explicitly setting
alpha channel value to 255, instead of having it undefined for x8r8g8b8). If it
is done as an intermediate step somewhere in the rendering pipeline, then it's
just a waste of cpu cycles (no matter if it is SSSE3 optimized or not, it
should preferably be just eliminated completely). If some other compositing
function expects a8r8g8b8 input, but is given x8r8g8b8, then this function can
be trivially extended to also support x8r8g8b8, at worst by adding just a
single OR instruction in it. And at best, it may even become faster after this
change (only 3 color components need to be processed instead of 4). I already
have mentioned it earlier regarding your use case [1]. 

Anyway, if the goal is to seriously improve pixman performance on Intel Atom,
then a lot more different operations need to be optimized too. Another simple
operation which is very similar to 'src_x888_' and still somewhat
practically useful is 'add__' as mentioned earlier in [2]. Your code
does not seem to be ready to support it without major modifications.

I have attached some simple example of providing unified support for both
32-bit and 64-bit systems and implementing multiple operations. Surely, this
example is missing non-executable stack markers, cfe directives and does not
really utilize SIMD to get good performance. But probably something like this
may be used as a starting point? With all the bells and whistles added,
eventually it might start to resemble current pixman ARM NEON code a lot. 

That said, I actually don't want to advocate any particular way of implementing
this stuff (prefer dynamic JIT code generation or assembly macros, AT&T or
Intel syntax, etc). It's just my vision about what features it should 
eventually have. But if your code works correctly, does not add maintenance
headaches (the part responsible for interacting with the rest of pixman must
be clean) and provides some measurable performance improvement, I would be fine
with it too. After all, it could be always safely scrapped and replaced with
something better if needed. But more about this later.

> We also fixed issue like CPU detection/git log/make file checking/file
> name...etc.

That would be nice to have as a separate patch. It has much higher chances
to be accepted and committed first.

> For MOVD, we simplified the backward copy code since pervious code is too
> long and not gain obvious performance,

And this is what I'm worried about. First you proposed this part of code
(backwards copy) without providing any initial explanation or benchmark
numbers. And now you abandon it, also without providing any benchmark
numbers or description of

Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-11-19 Thread Soeren Sandmann
"Xu, Samuel"  writes:

> Appreciate comments on this patch

I'll be away for the next three weeks, so I won't be able to review
anything until then. 


Soren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-11-18 Thread Xu, Samuel
Appreciate comments on this patch

-Original Message-
From: Xu, Samuel 
Sent: Wednesday, November 17, 2010 1:48 PM
To: sandm...@daimi.au.dk; Siarhei Siamashka; Ma, Ling; Liu, Xinyun; Zhao, 
Yakui; pixman@lists.freedesktop.org
Subject: RE: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Hi, Soeren Sandmann and Siarhei Siamashka:
Glad to send out this refreshed patch to address points we discussed 
for this SSSE3 patch. 
In this new patch, we merged 32 bit and 64 bit asm code into unified 
code (Macros are moved to header file, and make them more reusable and easily 
to be read). We also fixed issue like CPU detection/git log/make file 
checking/file name...etc. For MOVD, we simplified the backward copy code since 
pervious code is too long and not gain obvious performance, 

Thanks!
Samuel

-Original Message-
From: Xu, Samuel
Sent: Thursday, September 09, 2010 12:56 PM
To: 'sandm...@daimi.au.dk'; Siarhei Siamashka; Ma, Ling; Liu, Xinyun
Subject: RE: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Hi, Soeren Sandmann and Siarhei Siamashka:
It is really a long discussion and thanks for your patience! I believe all of 
us has same target to make pixman faster and better. Let's continue...

Merge 32 bit and 64 bit asm code into unified asm file is possible while has 
pros and cons. E.g. I just tried Sun studio(my first experience on SS) and 
found SS 12 can't expand Macro correctly, so failed to build some asm files 
containing Macro. To make SS work, we originally want to remove existing 
macros, while it seems we should use more macros.

Anyway, we will go ahead to following your 2 more suggestions: 1) movd adoption 
2) unify 32 bit and 64 bit asm code via macro.
So the new patch will be shaped as:
 1. Fix 64 bit CPU detection issue for MMX and SSE2  2. Add more comments for 
git commit log  3. change SSSE3 intrinsics check to SSSE3 asm check in makefile 
 4. remove #include "pixman-combine32.h" and composite_over__n_in 
pixman-ssse3.c  5. ASM files changes:
1) Unify 32 bit and 64 bit asm code via macro
2) change asm file name to pixman-ssse3-x86-32-asm.S and 
pixman-ssse3-x86-64-asm.S
3) change the asm function name to "composite_line_src_x888__ssse3"
4) remove "defined(_M_AMD64))"
5) Comments in the assembly about the store forwarding
6) Use MOVD as Siarhei Siamashka's idea

Please raise the flag if those change are not completed or if Sun studio's 
priority is higher than unified 32 bit and 64 bit code.

Thanks!
Samuel


-Original Message-
From: sandm...@daimi.au.dk [mailto:sandm...@daimi.au.dk]
Sent: Thursday, September 09, 2010 10:16 AM
To: Xu, Samuel
Cc: Siarhei Siamashka; pixman@lists.freedesktop.org; Liu, Xinyun; Ma, Ling
Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

"Xu, Samuel"  writes:

> Hi, Soeren Sandmann and Siarhei Siamashka:
> 
> As a wrap of current discussion, combining you two's comments, can we assume 
> this new patch of SSSE3 is ok?
> New patch might contains:
> 1. Fix 64 bit CPU detection issue for MMX and SSE2 2. Add more 
> comments for git commit log 3. change SSSE3 intrinsics check to SSSE3 
> asm check in makefile 4. remove #include "pixman-combine32.h" and 
> composite_over__n_in pixman-ssse3.c 5. ASM files changes:
>   1) change asm file name to pixman-ssse3-x86-32-asm.S and 
> pixman-ssse3-x86-64-asm.S
>   2) change the asm function name to "composite_line_src_x888__ssse3"
>   3) remove "defined(_M_AMD64))"
>   4) Comments in the assembly about the store forwarding

Such changes will certainly improve the patch.

> We know there are some discussion on asm code, e.g. MOVD, unify 32 bit 
> and 64 bit code. While it won't introduce real defection. We still can 
> put further change in next wave, to avoid current sticking.

No, we are not going to apply any patches that we know will need to be fixed 
later, because I don't believe anyone is going to do that work.

It's fine to send a patch series if you think it's easier that way.

I am also interested in the answer to the question of whether this code was 
generated by passing intrinsics through the Intel compiler.

> I am not sure for the issue of sun studio. Sun studio declares GNU asm 
> compatibility, I am not sure whether it is 100% compatible. If issue 
> is caused by Sun Studio itself, can we add #ifdef to avoid
> SSSE3 patch of Sun studio? In this case, how to determine Sun studio?

There are various approaches you can take with respect to unusual compilers 
such as Sun Studio. The most common approach is to only enable the new 
optimizations for the compilers that you are personally interested in, and 
leave other compilers to people who are interested.

However, *y

Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-11-16 Thread Xu, Samuel
Hi, Soeren Sandmann and Siarhei Siamashka:
Glad to send out this refreshed patch to address points we discussed 
for this SSSE3 patch. 
In this new patch, we merged 32 bit and 64 bit asm code into unified 
code (Macros are moved to header file, and make them more reusable and easily 
to be read). We also fixed issue like CPU detection/git log/make file 
checking/file name...etc. For MOVD, we simplified the backward copy code since 
pervious code is too long and not gain obvious performance, 

Thanks!
Samuel

-Original Message-
From: Xu, Samuel
Sent: Thursday, September 09, 2010 12:56 PM
To: 'sandm...@daimi.au.dk'; Siarhei Siamashka; Ma, Ling; Liu, Xinyun
Subject: RE: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Hi, Soeren Sandmann and Siarhei Siamashka:
It is really a long discussion and thanks for your patience! I believe all of 
us has same target to make pixman faster and better. Let's continue...

Merge 32 bit and 64 bit asm code into unified asm file is possible while has 
pros and cons. E.g. I just tried Sun studio(my first experience on SS) and 
found SS 12 can't expand Macro correctly, so failed to build some asm files 
containing Macro. To make SS work, we originally want to remove existing 
macros, while it seems we should use more macros.

Anyway, we will go ahead to following your 2 more suggestions: 1) movd adoption 
2) unify 32 bit and 64 bit asm code via macro.
So the new patch will be shaped as:
 1. Fix 64 bit CPU detection issue for MMX and SSE2  2. Add more comments for 
git commit log  3. change SSSE3 intrinsics check to SSSE3 asm check in makefile 
 4. remove #include "pixman-combine32.h" and composite_over__n_in 
pixman-ssse3.c  5. ASM files changes:
1) Unify 32 bit and 64 bit asm code via macro
2) change asm file name to pixman-ssse3-x86-32-asm.S and 
pixman-ssse3-x86-64-asm.S
3) change the asm function name to "composite_line_src_x888__ssse3"
4) remove "defined(_M_AMD64))"
5) Comments in the assembly about the store forwarding
6) Use MOVD as Siarhei Siamashka's idea

Please raise the flag if those change are not completed or if Sun studio's 
priority is higher than unified 32 bit and 64 bit code.

Thanks!
Samuel


-Original Message-
From: sandm...@daimi.au.dk [mailto:sandm...@daimi.au.dk]
Sent: Thursday, September 09, 2010 10:16 AM
To: Xu, Samuel
Cc: Siarhei Siamashka; pixman@lists.freedesktop.org; Liu, Xinyun; Ma, Ling
Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

"Xu, Samuel"  writes:

> Hi, Soeren Sandmann and Siarhei Siamashka:
> 
> As a wrap of current discussion, combining you two's comments, can we assume 
> this new patch of SSSE3 is ok?
> New patch might contains:
> 1. Fix 64 bit CPU detection issue for MMX and SSE2 2. Add more 
> comments for git commit log 3. change SSSE3 intrinsics check to SSSE3 
> asm check in makefile 4. remove #include "pixman-combine32.h" and 
> composite_over__n_in pixman-ssse3.c 5. ASM files changes:
>   1) change asm file name to pixman-ssse3-x86-32-asm.S and 
> pixman-ssse3-x86-64-asm.S
>   2) change the asm function name to "composite_line_src_x888__ssse3"
>   3) remove "defined(_M_AMD64))"
>   4) Comments in the assembly about the store forwarding

Such changes will certainly improve the patch.

> We know there are some discussion on asm code, e.g. MOVD, unify 32 bit 
> and 64 bit code. While it won't introduce real defection. We still can 
> put further change in next wave, to avoid current sticking.

No, we are not going to apply any patches that we know will need to be fixed 
later, because I don't believe anyone is going to do that work.

It's fine to send a patch series if you think it's easier that way.

I am also interested in the answer to the question of whether this code was 
generated by passing intrinsics through the Intel compiler.

> I am not sure for the issue of sun studio. Sun studio declares GNU asm 
> compatibility, I am not sure whether it is 100% compatible. If issue 
> is caused by Sun Studio itself, can we add #ifdef to avoid
> SSSE3 patch of Sun studio? In this case, how to determine Sun studio?

There are various approaches you can take with respect to unusual compilers 
such as Sun Studio. The most common approach is to only enable the new 
optimizations for the compilers that you are personally interested in, and 
leave other compilers to people who are interested.

However, *you* introduced all this Sun Studio handling for SSSE3, which means 
*you* need to be able to explain what it does, and justify why it is there. 

Or to put it even more bluntly: Do not cut and paste code you don't understand 
and can't test, and

Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-09-08 Thread Ma, Ling
Although the code appended "or" operation, but after many tuning work,
I don't see any regression compared with original gcc memcpy assembly code.

Thanks
Ling

> -Original Message-
> From: Ma, Ling
> Sent: Thursday, September 09, 2010 1:55 PM
> To: 'sandm...@daimi.au.dk'; Xu, Samuel
> Cc: Siarhei Siamashka; pixman@lists.freedesktop.org; Liu, Xinyun
> Subject: RE: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
> 
> Hi Soren
> > I am also interested in the answer to the question of whether this
> > code was generated by passing intrinsics through the Intel compiler.
> 
> I'm original author for this code, the code is based on our original memcpy
> function which has been pushed into glibc.
> 
> Thanks
> Ling
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-09-08 Thread Ma, Ling
Hi Soren
> I am also interested in the answer to the question of whether this
> code was generated by passing intrinsics through the Intel compiler.

I'm original author for this code, the code is based on our original memcpy 
function which has been pushed into glibc.

Thanks
Ling
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-09-08 Thread Soeren Sandmann
"Xu, Samuel"  writes:

> Hi, Soeren Sandmann and Siarhei Siamashka:
> 
> As a wrap of current discussion, combining you two's comments, can we assume 
> this new patch of SSSE3 is ok?
> New patch might contains:
> 1. Fix 64 bit CPU detection issue for MMX and SSE2
> 2. Add more comments for git commit log
> 3. change SSSE3 intrinsics check to SSSE3 asm check in makefile
> 4. remove #include "pixman-combine32.h" and composite_over__n_in 
> pixman-ssse3.c
> 5. ASM files changes:
>   1) change asm file name to pixman-ssse3-x86-32-asm.S and 
> pixman-ssse3-x86-64-asm.S
>   2) change the asm function name to "composite_line_src_x888__ssse3"
>   3) remove "defined(_M_AMD64))"
>   4) Comments in the assembly about the store forwarding

Such changes will certainly improve the patch.

> We know there are some discussion on asm code, e.g. MOVD, unify 32
> bit and 64 bit code. While it won't introduce real defection. We
> still can put further change in next wave, to avoid current
> sticking.

No, we are not going to apply any patches that we know will need to be
fixed later, because I don't believe anyone is going to do that work.

It's fine to send a patch series if you think it's easier that way.

I am also interested in the answer to the question of whether this
code was generated by passing intrinsics through the Intel compiler.

> I am not sure for the issue of sun studio. Sun studio declares GNU
> asm compatibility, I am not sure whether it is 100% compatible. If
> issue is caused by Sun Studio itself, can we add #ifdef to avoid
> SSSE3 patch of Sun studio? In this case, how to determine Sun
> studio?

There are various approaches you can take with respect to unusual
compilers such as Sun Studio. The most common approach is to only
enable the new optimizations for the compilers that you are personally
interested in, and leave other compilers to people who are interested.

However, *you* introduced all this Sun Studio handling for SSSE3,
which means *you* need to be able to explain what it does, and justify
why it is there. 

Or to put it even more bluntly: Do not cut and paste code you don't
understand and can't test, and then ask me to rewrite the code for you
until it is good enough to go in.


Soren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-09-07 Thread Xu, Samuel
Hi, Soeren Sandmann and Siarhei Siamashka:

As a wrap of current discussion, combining you two's comments, can we assume 
this new patch of SSSE3 is ok?
New patch might contains:
1. Fix 64 bit CPU detection issue for MMX and SSE2
2. Add more comments for git commit log
3. change SSSE3 intrinsics check to SSSE3 asm check in makefile
4. remove #include "pixman-combine32.h" and composite_over__n_in 
pixman-ssse3.c
5. ASM files changes:
1) change asm file name to pixman-ssse3-x86-32-asm.S and 
pixman-ssse3-x86-64-asm.S
2) change the asm function name to "composite_line_src_x888__ssse3"
3) remove "defined(_M_AMD64))"
4) Comments in the assembly about the store forwarding
We know there are some discussion on asm code, e.g. MOVD, unify 32 bit and 64 
bit code. While it won't introduce real defection. We still can put further 
change in next wave, to avoid current sticking.

I am not sure for the issue of sun studio. Sun studio declares GNU asm 
compatibility, I am not sure whether it is 100% compatible. If issue is caused 
by Sun Studio itself, can we add #ifdef to avoid SSSE3 patch of Sun studio? In 
this case, how to determine Sun studio?

Thanks!
Samuel

-Original Message-
From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] 
Sent: Wednesday, September 08, 2010 3:39 AM
To: Ma, Ling
Cc: Xu, Samuel; Soeren Sandmann; pixman@lists.freedesktop.org; Liu, Xinyun
Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

On Tuesday 07 September 2010 14:03:52 Ma, Ling wrote:
> > > > Wouldn't just the use of MOVD/MOVSS instructions here also solve
> > > > this problem?  Store forwarding does not seem to be used for SIMD
> > > > according to the manual. I haven't benchmarked anything yet
> > > > though.

> [Ma Ling] movd is not good for Atom because of it will use AGU- " * 
> Integer-FP/SIMD transfer: Instructions that transfer integer data to 
> the FP/SIMD side of the machine also uses AGU. Examples of these 
> instructions include MOVD, PINSRW. If one of the source register of 
> these instructions depends on the result of an execution unit, this 
> dependency will also cause a delay of 3 cycles." 12.3.2.2 Address 
> Generation

Well, a simple benchmark shows that MOVD performs at least as good as MOV if we 
compare the following loops:

 80483e0:   66 0f 6e 06 movd   (%esi),%xmm0
 80483e4:   66 0f eb c1 por%xmm1,%xmm0
 80483e8:   66 0f 7e 07 movd   %xmm0,(%edi)
 80483ec:   49  dec%ecx
 80483ed:   75 f1   jne80483e0 

vs.

 80483e0:   8b 06   mov(%esi),%eax
 80483e2:   09 d8   or %ebx,%eax
 80483e4:   89 07   mov%eax,(%edi)
 80483e6:   49  dec%ecx
 80483e7:   75 f7   jne80483e0 

Both of these loops use 3 cycles per iteration as can be easily measured by a 
simple modification of the benchmark program that I posted earlier.
The only advantage of non-SSE code here is that it is smaller. But you are 
anyway killing this code size advantage by keeping both forward and backwards 
copy variants.

The part from the optimization manual which you quoted is related to moving 
data between x86 and SSE registers, which is indeed a bad idea. But there is no 
need doing anything like this. You can just load data directly from memory to 
SSE registers, do some operations with it (using SSE instructions) and store 
the result to memory.

> > Your code still can be simplified a lot. I'm just not quite sure 
> > whether it would be more practical to commit something first and 
> > then refactor it with the follow up commits. Or attempt to make a 
> > "perfect" patch before committing.

> [Ma Ling] Yes, I agree with you, let us commit it first, then strength 
> it, such as appending non-temporary instructions for large data copy 
> which is over L1 cache size.

Actually it's not up to me to decide. Much more important is whether Søren 
Sandmann agrees to this or not. I'm just afraid that trying to reach 
perfection, we may get stuck with no further progress at some point. Because 
for example some of the things, which are clear and simple for me, may be too 
unusual or unfamiliar for you. Or possibly the other way around.

But in any case, the other issues pointed by Søren still can and need to be 
addressed.

Just one last question to clarify things. Did you write this SSSE3 assembly 
code yourself or was the output of some C compiler (at least partially) used 
for it?

--
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-09-07 Thread Siarhei Siamashka
On Friday 03 September 2010 01:39:54 Soeren Sandmann wrote:
> Siarhei Siamashka  writes:
> > Apparently software prefetch also disables or interferes with the hardware
> > prefetcher on Intel Atom, hurting performance a lot. More advanced
> > processors can cope with it.
> > 
> > But increased prefetch distance is less effective (or can even decrease
> > performance) when dealing with small images, so it is not always good.
> > 
> > Are there any SSE2 capable x86 processors without hardware prefetch
> > capability? Maybe it's really a good idea to remove software prefetch
> > from SSE2 fast path code?
> 
> Yeah, it seems so. All data so far suggests that software prefetching
> is somewhere between 'slight slowdown' and 'no real effect'. The SSE2
> fast paths have very predictable memory access patterns, so a hardware
> prefetcher should be able to do a good job.
> 
> (It might be worth investigating whether software prefetch would be
> beneficial in the tiled rotation fast paths, since the access patterns
> there could be much harder to predict for the hardware).

So I guess it makes sense for Liu, Xinyun to resubmit a clean patch with 
software prefetch removal for sse2 code instead of commenting it out?

As suggested in:
http://lists.freedesktop.org/archives/pixman/2010-June/000220.html

-- 
Best regards,
Siarhei Siamashka


signature.asc
Description: This is a digitally signed message part.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-09-07 Thread Siarhei Siamashka
On Tuesday 07 September 2010 14:03:52 Ma, Ling wrote:
> > > > Wouldn't just the use of MOVD/MOVSS instructions here also solve
> > > > this problem?  Store forwarding does not seem to be used for SIMD
> > > > according to the manual. I haven't benchmarked anything yet
> > > > though.

> [Ma Ling] movd is not good for Atom because of it will use AGU-
> " * Integer-FP/SIMD transfer: Instructions that transfer integer data to
> the FP/SIMD side of the machine also uses AGU. Examples of these
> instructions include MOVD, PINSRW. If one of the source register of these
> instructions depends on the result of an execution unit, this dependency
> will also cause a delay of 3 cycles." 12.3.2.2 Address Generation

Well, a simple benchmark shows that MOVD performs at least as good as MOV if we
compare the following loops:

 80483e0:   66 0f 6e 06 movd   (%esi),%xmm0
 80483e4:   66 0f eb c1 por%xmm1,%xmm0
 80483e8:   66 0f 7e 07 movd   %xmm0,(%edi)
 80483ec:   49  dec%ecx
 80483ed:   75 f1   jne80483e0 

vs.

 80483e0:   8b 06   mov(%esi),%eax
 80483e2:   09 d8   or %ebx,%eax
 80483e4:   89 07   mov%eax,(%edi)
 80483e6:   49  dec%ecx
 80483e7:   75 f7   jne80483e0 

Both of these loops use 3 cycles per iteration as can be easily measured by
a simple modification of the benchmark program that I posted earlier.
The only advantage of non-SSE code here is that it is smaller. But you are
anyway killing this code size advantage by keeping both forward and backwards
copy variants.

The part from the optimization manual which you quoted is related to moving
data between x86 and SSE registers, which is indeed a bad idea. But there is no
need doing anything like this. You can just load data directly from memory to
SSE registers, do some operations with it (using SSE instructions) and store
the result to memory.

> > Your code still can be simplified a lot. I'm just not quite sure whether it
> > would be more practical to commit something first and then refactor it with
> > the follow up commits. Or attempt to make a "perfect" patch before
> > committing.

> [Ma Ling] Yes, I agree with you, let us commit it first, then strength it,
> such as appending non-temporary instructions for large data copy which is
> over L1 cache size.

Actually it's not up to me to decide. Much more important is whether
Søren Sandmann agrees to this or not. I'm just afraid that trying to reach
perfection, we may get stuck with no further progress at some point. Because
for example some of the things, which are clear and simple for me, may be too
unusual or unfamiliar for you. Or possibly the other way around.

But in any case, the other issues pointed by Søren still can and need to be
addressed.

Just one last question to clarify things. Did you write this SSSE3 assembly
code yourself or was the output of some C compiler (at least partially) used
for it?

-- 
Best regards,
Siarhei Siamashka


signature.asc
Description: This is a digitally signed message part.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-09-07 Thread Ma, Ling
> Your code still can be simplified a lot. I'm just not quite sure whether it 
> would
> be more practical to commit something first and then refactor it with the 
> follow
> up commits. Or attempt to make a "perfect" patch before committing.
[Ma Ling] Yes, I agree with you, let us commit it first, then strength it,
such as appending non-temporary instructions for large data copy which is over 
L1 cache size.

Best Regards
Ling
> 
> 
>  pixman/pixman-access-ssse3_x86-64.S |   96 --
>  1 files changed, 12 insertions(+), 84 deletions(-)
> 
> diff --git a/pixman/pixman-access-ssse3_x86-64.S b/pixman/pixman-access-
> ssse3_x86-64.S index e7cf21f..0946d20 100755
> --- a/pixman/pixman-access-ssse3_x86-64.S
> +++ b/pixman/pixman-access-ssse3_x86-64.S
> @@ -248,116 +248,44 @@ L(shl_0_cache_less_16bytes):
> add %rdx, %rdi
> BRANCH_TO_JMPTBL_ENTRY (L(table_48bytes_fwd), %rdx, 4)
> 
> -L(shl_4):
> +.irp shift, 4, 8, 12
> +L(shl_\shift):
> lea -32(%rdx), %rdx
> ALIGN (4)
> -L(shl_4_loop):
> +L(shl_\shift\()_loop):
> movaps  16(%rsi), %xmm2
> sub $32, %rdx
> movaps  32(%rsi), %xmm3
> lea 32(%rsi), %rsi
> movdqa  %xmm3, %xmm4
> -   palignr $4, %xmm2, %xmm3
> +   palignr $\shift, %xmm2, %xmm3
> lea 32(%rdi), %rdi
> -   palignr $4, %xmm1, %xmm2
> +   palignr $\shift, %xmm1, %xmm2
> por %xmm6, %xmm2
> movaps  %xmm2, -32(%rdi)
> por %xmm6, %xmm3
> movaps  %xmm3, -16(%rdi)
> -   jb  L(shl_4_end)
> +   jb  L(shl_\shift\()_end)
> 
> movaps  16(%rsi), %xmm2
> sub $32, %rdx
> movaps  32(%rsi), %xmm3
> lea 32(%rsi), %rsi
> movdqa  %xmm3, %xmm1
> -   palignr $4, %xmm2, %xmm3
> +   palignr $\shift, %xmm2, %xmm3
> lea 32(%rdi), %rdi
> -   palignr $4, %xmm4, %xmm2
> +   palignr $\shift, %xmm4, %xmm2
> por %xmm6, %xmm2
> movaps  %xmm2, -32(%rdi)
> por %xmm6, %xmm3
> movaps  %xmm3, -16(%rdi)
> -   jae L(shl_4_loop)
> -L(shl_4_end):
> +   jae L(shl_\shift\()_loop)
> +L(shl_\shift\()_end):
> lea 32(%rdx), %rdx
> -   lea 4(%rsi, %rdx), %rsi
> -   add %rdx, %rdi
> -   BRANCH_TO_JMPTBL_ENTRY(L(table_48bytes_fwd), %rdx, 4)
> -
> -L(shl_8):
> -   lea -32(%rdx), %rdx
> -   ALIGN (4)
> -L(shl_8_loop):
> -   movaps  16(%rsi), %xmm2
> -   sub $32, %rdx
> -   movaps  32(%rsi), %xmm3
> -   lea 32(%rsi), %rsi
> -   movdqa  %xmm3, %xmm4
> -   palignr $8, %xmm2, %xmm3
> -   lea 32(%rdi), %rdi
> -   palignr $8, %xmm1, %xmm2
> -   por %xmm6, %xmm2
> -   movaps  %xmm2, -32(%rdi)
> -   por %xmm6, %xmm3
> -   movaps  %xmm3, -16(%rdi)
> -   jb  L(shl_8_end)
> -
> -   movaps  16(%rsi), %xmm2
> -   sub $32, %rdx
> -   movaps  32(%rsi), %xmm3
> -   lea 32(%rsi), %rsi
> -   movdqa  %xmm3, %xmm1
> -   palignr $8, %xmm2, %xmm3
> -   lea 32(%rdi), %rdi
> -   palignr $8, %xmm4, %xmm2
> -   por %xmm6, %xmm2
> -   movaps  %xmm2, -32(%rdi)
> -   por %xmm6, %xmm3
> -   movaps  %xmm3, -16(%rdi)
> -   jae L(shl_8_loop)
> -L(shl_8_end):
> -   lea 32(%rdx), %rdx
> -   lea 8(%rsi, %rdx), %rsi
> -   add %rdx, %rdi
> -   BRANCH_TO_JMPTBL_ENTRY(L(table_48bytes_fwd), %rdx, 4)
> -
> -L(shl_12):
> -   lea -32(%rdx), %rdx
> -   ALIGN (4)
> -L(shl_12_loop):
> -   movaps  16(%rsi), %xmm2
> -   sub $32, %rdx
> -   movaps  32(%rsi), %xmm3
> -   lea 32(%rsi), %rsi
> -   movdqa  %xmm3, %xmm4
> -   palignr $12, %xmm2, %xmm3
> -   lea 32(%rdi), %rdi
> -   palignr $12, %xmm1, %xmm2
> -   por %xmm6, %xmm2
> -   movaps  %xmm2, -32(%rdi)
> -   por %xmm6, %xmm3
> -   movaps  %xmm3, -16(%rdi)
> -   jb  L(shl_12_end)
> -
> -   movaps  16(%rsi), %xmm2
> -   sub $32, %rdx
> -   movaps  32(%rsi), %xmm3
> -   lea 32(%rsi), %rsi
> -   movdqa  %xmm3, %xmm1
> -   palignr $12, %xmm2, %xmm3
> -   lea 32(%rdi), %rdi
> -   palignr $12, %xmm4, %xmm2
> -   por %xmm6, %xmm2
> -   movaps  %xmm2, -32(%rdi)
> -   por %xmm6, %xmm3
> -   movaps  %xmm3, -16(%rdi)
> -   jae L(shl_12_loop)
> -L(shl_12_end):
> -   lea 32(%rdx), %rdx
> -   lea 12(%rsi, %rdx), %rsi
> +   lea \shift\()(%rsi, %rdx), %rsi
> add %rdx, %rdi
> BRANCH_TO_JMPTBL_ENTRY(L(table_48bytes_fwd), %rdx, 4)
> +.endr
> 
> ALIGN (4)
>  L(fwd_write_44bytes):
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-09-07 Thread Ma, Ling
Hi Siarhei Siamashka

> Could you elaborate? Preferably with a reference to the relevant section of 
> the
> optimization manual. Because it looks like exactly the store forwarding 
> address
> aliasing issue to me.
[Ma Ling]:Currently it is not described in our optimization manual, soon it 
will be published in new version.

> My understanding is that Intel Atom processor has some special logic for fast
> handling of a quite common read-after-write memory access pattern (some
> operations with the local variables on stack for example). Unfortunately only
> the lowest 12 bits of the address are used for the initial selection of this 
> fast
> path. And in the case if it turns out to be a false positive later in the 
> pipeline,
> there is some performance penalty to handle this situation correctly.

[Ma Ling]:Intel have this HW optimization is store-forward as you mentioned, 
but there are some 
constraints about it, the future doesn't cover the cases we described before.

> Here is a simple test program which can be used for benchmarking:
> 
> /***/
> .intel_syntax noprefix
> .global main
> 
> .data
> 
> .balign 16
> buffer:
> .rept 8192
> .byte 0
> .endr
> 
> .text
> 
> main:
> pusha
> 
> leaedi, buffer
> #ifdef ALIASING
> leaesi, [edi+4096]
> #else
> leaesi, [edi+4096+64]
> #endif
> movecx, 166000  /* 1.66GHz */
> jmp1f
> 
> /* main loop */
> .balign 16
> 1:
> #ifdef SSE
> movd   dword ptr [edi], xmm0
> movd   dword ptr [edi+4], xmm1
> movd   xmm0, dword ptr [esi]
> movd   xmm1, dword ptr [esi+4]
> #else
> movdword ptr [edi], eax
> movdword ptr [edi+4], ebx
> moveax, dword ptr [esi]
> movebx, dword ptr [esi+4]
> #endif
> dececx
> jnz1b
> 
> popa
> ret
> /***/
> 
> $ gcc -m32 -o bench-x86-noaliasing bench-storefw.S $ gcc -m32 -DALIASING -o
> bench-x86-aliasing bench-storefw.S $ gcc -m32 -DSSE -o bench-sse-noaliasing
> bench-storefw.S $ gcc -m32 -DSSE -DALIASING -o bench-sse-aliasing
> bench-storefw.S
> 
> $ time ./bench-x86-noaliasing
> 
> real0m4.057s
> user0m4.032s
> sys 0m0.000s
> 
> $ time ./bench-x86-aliasing
> 
> real0m11.059s
> user0m11.041s
> sys 0m0.008s
> 
> $ time ./bench-sse-noaliasing
> 
> real0m4.048s
> user0m4.036s
> sys 0m0.004s
> 
> $ time ./bench-sse-aliasing
> 
> real0m4.046s
> user0m4.036s
> sys 0m0.004s
> 
> So each loop iteration always takes 4 cycles except when using standard x86
> mov instruction with the aliasing of the lowest 12 bits in the address. SSE
> instructions movd/movss do not have any aliasing problems.
> 
> > > Wouldn't just the use of MOVD/MOVSS instructions here also solve
> > > this problem?  Store forwarding does not seem to be used for SIMD
> > > according to the manual. I haven't benchmarked anything yet
> > > though.
> > >
> > >
> > > http://lists.cairographics.org/archives/pixman/2010-August/000425.ht
> > > ml
> 
> I still think this needs to be clarified. Using movd or movss instructions 
> there
> has an additional benefit that you can use the same operation on pixel data in
> all cases.

[Ma Ling] movd is not good for Atom because of it will use AGU-
" * Integer-FP/SIMD transfer: Instructions that transfer integer data to the
FP/SIMD side of the machine also uses AGU. Examples of these instructions
include MOVD, PINSRW. If one of the source register of these instructions
depends on the result of an execution unit, this dependency will also cause a
delay of 3 cycles." 12.3.2.2 Address Generation


Best Regards
Ling


___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-09-03 Thread Siarhei Siamashka
On Friday 03 September 2010 11:53:47 Xu, Samuel wrote:
> >* Siarhei asked whether it would be possible to unify the 32 and 64
> >
> >  bit assembly sources. I don't think you commented on that.
> 
> I think it is very difficult to unify 32 and 64 bit assemble src.

It's not so difficult if you change the explicit register names to some macros. 
Then you can have something like SRC_PTR in the code which can expand to %rsi 
for x86-64 and to %esi for x86.

There are other things which can be improved too.  For example the use of 
".irp/.endr" directive can reduce source code size by collapsing the repeatable 
blocks.

Your code still can be simplified a lot. I'm just not quite sure whether it
would be more practical to commit something first and then refactor it with
the follow up commits. Or attempt to make a "perfect" patch before committing.


 pixman/pixman-access-ssse3_x86-64.S |   96 --
 1 files changed, 12 insertions(+), 84 deletions(-)

diff --git a/pixman/pixman-access-ssse3_x86-64.S b/pixman/pixman-access-
ssse3_x86-64.S
index e7cf21f..0946d20 100755
--- a/pixman/pixman-access-ssse3_x86-64.S
+++ b/pixman/pixman-access-ssse3_x86-64.S
@@ -248,116 +248,44 @@ L(shl_0_cache_less_16bytes):
add %rdx, %rdi
BRANCH_TO_JMPTBL_ENTRY (L(table_48bytes_fwd), %rdx, 4)
 
-L(shl_4):
+.irp shift, 4, 8, 12
+L(shl_\shift):
lea -32(%rdx), %rdx
ALIGN (4)
-L(shl_4_loop):
+L(shl_\shift\()_loop):
movaps  16(%rsi), %xmm2
sub $32, %rdx
movaps  32(%rsi), %xmm3
lea 32(%rsi), %rsi
movdqa  %xmm3, %xmm4
-   palignr $4, %xmm2, %xmm3
+   palignr $\shift, %xmm2, %xmm3
lea 32(%rdi), %rdi
-   palignr $4, %xmm1, %xmm2
+   palignr $\shift, %xmm1, %xmm2
por %xmm6, %xmm2
movaps  %xmm2, -32(%rdi)
por %xmm6, %xmm3
movaps  %xmm3, -16(%rdi)
-   jb  L(shl_4_end)
+   jb  L(shl_\shift\()_end)
 
movaps  16(%rsi), %xmm2
sub $32, %rdx
movaps  32(%rsi), %xmm3
lea 32(%rsi), %rsi
movdqa  %xmm3, %xmm1
-   palignr $4, %xmm2, %xmm3
+   palignr $\shift, %xmm2, %xmm3
lea 32(%rdi), %rdi
-   palignr $4, %xmm4, %xmm2
+   palignr $\shift, %xmm4, %xmm2
por %xmm6, %xmm2
movaps  %xmm2, -32(%rdi)
por %xmm6, %xmm3
movaps  %xmm3, -16(%rdi)
-   jae L(shl_4_loop)
-L(shl_4_end):
+   jae L(shl_\shift\()_loop)
+L(shl_\shift\()_end):
lea 32(%rdx), %rdx
-   lea 4(%rsi, %rdx), %rsi
-   add %rdx, %rdi
-   BRANCH_TO_JMPTBL_ENTRY(L(table_48bytes_fwd), %rdx, 4)
-
-L(shl_8):
-   lea -32(%rdx), %rdx
-   ALIGN (4)
-L(shl_8_loop):
-   movaps  16(%rsi), %xmm2
-   sub $32, %rdx
-   movaps  32(%rsi), %xmm3
-   lea 32(%rsi), %rsi
-   movdqa  %xmm3, %xmm4
-   palignr $8, %xmm2, %xmm3
-   lea 32(%rdi), %rdi
-   palignr $8, %xmm1, %xmm2
-   por %xmm6, %xmm2
-   movaps  %xmm2, -32(%rdi)
-   por %xmm6, %xmm3
-   movaps  %xmm3, -16(%rdi)
-   jb  L(shl_8_end)
-
-   movaps  16(%rsi), %xmm2
-   sub $32, %rdx
-   movaps  32(%rsi), %xmm3
-   lea 32(%rsi), %rsi
-   movdqa  %xmm3, %xmm1
-   palignr $8, %xmm2, %xmm3
-   lea 32(%rdi), %rdi
-   palignr $8, %xmm4, %xmm2
-   por %xmm6, %xmm2
-   movaps  %xmm2, -32(%rdi)
-   por %xmm6, %xmm3
-   movaps  %xmm3, -16(%rdi)
-   jae L(shl_8_loop)
-L(shl_8_end):
-   lea 32(%rdx), %rdx
-   lea 8(%rsi, %rdx), %rsi
-   add %rdx, %rdi
-   BRANCH_TO_JMPTBL_ENTRY(L(table_48bytes_fwd), %rdx, 4)
-
-L(shl_12):
-   lea -32(%rdx), %rdx
-   ALIGN (4)
-L(shl_12_loop):
-   movaps  16(%rsi), %xmm2
-   sub $32, %rdx
-   movaps  32(%rsi), %xmm3
-   lea 32(%rsi), %rsi
-   movdqa  %xmm3, %xmm4
-   palignr $12, %xmm2, %xmm3
-   lea 32(%rdi), %rdi
-   palignr $12, %xmm1, %xmm2
-   por %xmm6, %xmm2
-   movaps  %xmm2, -32(%rdi)
-   por %xmm6, %xmm3
-   movaps  %xmm3, -16(%rdi)
-   jb  L(shl_12_end)
-
-   movaps  16(%rsi), %xmm2
-   sub $32, %rdx
-   movaps  32(%rsi), %xmm3
-   lea 32(%rsi), %rsi
-   movdqa  %xmm3, %xmm1
-   palignr $12, %xmm2, %xmm3
-   lea 32(%rdi), %rdi
-   palignr $12, %xmm4, %xmm2
-   por %xmm6, %xmm2
-   movaps  %xmm2, -32(%rdi)
-   por %xmm6, %xmm3
-   movaps  %xmm3, -16(%rdi)
-   jae L(shl_12_loop)
-L(shl_12_end):
-   lea 32(%rdx), %rdx
-   lea 12(%rsi, %rdx), %rsi
+   lea \shift\()(%rsi, %rdx), %rsi
add %rdx, %rdi
BRANCH_TO_JMPTBL_ENTRY(L(table_48bytes_fwd), %rdx, 4)
+.endr
 
ALIGN (4)
 L(fwd_write_44bytes):


signature.asc
Description: This is a digitally signed message part.
_

Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-09-03 Thread Siarhei Siamashka
On Friday 03 September 2010 11:53:47 Xu, Samuel wrote:
> >* Store forwarding
> >
> >  - We need some comments in the assembly about the store forwarding
> >  
> >that Ma Ling described.
> 
> How about this comments added to asm code:
> "CPU doesn't check each address bit for src and dest, so read could fail to
> recognize different address even they are in different page, read
> operation have to force previous write operation to commit data from store
> buffer to cache, the process impact performance seriously. This is work
> around to avoid this cpu limitation."
> 
> >  - Siarhei's questions about it should be answered, preferably in
> >  
> >those comments:
> > So it is basically a store forwarding aliasing problem which is
> > described in "12.3.3.1 Store Forwarding" section from "Intel(r) 64
> > and IA-32 Architectures Optimization Reference Manual", right?
> 
> The behavior doesn't belong to store forward which is described in
> "12.3.3.1 Store Forwarding" section from "Intel(r) 64 and IA-32
> Architectures Optimization Reference Manual". That is another HW
> optimization.

Could you elaborate? Preferably with a reference to the relevant section of the
optimization manual. Because it looks like exactly the store forwarding
address aliasing issue to me.

My understanding is that Intel Atom processor has some special logic for fast 
handling of a quite common read-after-write memory access pattern (some 
operations with the local variables on stack for example). Unfortunately only 
the lowest 12 bits of the address are used for the initial selection of this 
fast path. And in the case if it turns out to be a false positive later in the 
pipeline, there is some performance penalty to handle this situation correctly.

Here is a simple test program which can be used for benchmarking:

/***/
.intel_syntax noprefix
.global main

.data

.balign 16
buffer:
.rept 8192
.byte 0
.endr

.text

main:
pusha

leaedi, buffer
#ifdef ALIASING
leaesi, [edi+4096]
#else
leaesi, [edi+4096+64]
#endif
movecx, 166000  /* 1.66GHz */
jmp1f

/* main loop */
.balign 16
1:
#ifdef SSE
movd   dword ptr [edi], xmm0
movd   dword ptr [edi+4], xmm1
movd   xmm0, dword ptr [esi]
movd   xmm1, dword ptr [esi+4]
#else
movdword ptr [edi], eax
movdword ptr [edi+4], ebx
moveax, dword ptr [esi]
movebx, dword ptr [esi+4]
#endif
dececx
jnz1b

popa
ret
/***/

$ gcc -m32 -o bench-x86-noaliasing bench-storefw.S 
$ gcc -m32 -DALIASING -o bench-x86-aliasing bench-storefw.S 
$ gcc -m32 -DSSE -o bench-sse-noaliasing bench-storefw.S 
$ gcc -m32 -DSSE -DALIASING -o bench-sse-aliasing bench-storefw.S 

$ time ./bench-x86-noaliasing 

real0m4.057s
user0m4.032s
sys 0m0.000s

$ time ./bench-x86-aliasing 

real0m11.059s
user0m11.041s
sys 0m0.008s

$ time ./bench-sse-noaliasing 

real0m4.048s
user0m4.036s
sys 0m0.004s

$ time ./bench-sse-aliasing 

real0m4.046s
user0m4.036s
sys 0m0.004s

So each loop iteration always takes 4 cycles except when using standard x86 mov 
instruction with the aliasing of the lowest 12 bits in the address. SSE 
instructions movd/movss do not have any aliasing problems.

> > Wouldn't just the use of MOVD/MOVSS instructions here also solve
> > this problem?  Store forwarding does not seem to be used for SIMD
> > according to the manual. I haven't benchmarked anything yet
> > though.
> > 
> >http://lists.cairographics.org/archives/pixman/2010-August/000425.html

I still think this needs to be clarified. Using movd or movss instructions
there has an additional benefit that you can use the same operation on pixel
data in all cases.

Currently you use "por %xmm6, %xmm0" in the main loop and "or 40(%rsi), %ecx" 
for the trailing pixels. You could easily use just "por" instruction
everywhere. It's not a big deal here, just inconsistent. With your approach and 
more complex compositing operations, you would need to maintain both sse2
and x86 implementations separately for the main loop and for the trailing
pixels. This can and IMHO should be avoided.

-- 
Best regards,
Siarhei Siamashka


signature.asc
Description: This is a digitally signed message part.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-09-03 Thread Xu, Samuel
Thanks Soren!
Following is my comment. 
Appreciate your explicitly code guide on Sun Studio check, I never has 
experience on Sun Studio.
After confirming of Sun Studio related, we can send next patch to address rest 
issues.

Samuel

> * The 64 bit CPU detection is broken. It doesn't use SSE2 and MMX when
>  SSSE3 is not detected.
Confirmed and will fix in next patch as following :
...
#endif /* end dt_cpu64() */

#ifdef USE_MMX
#if (!defined(__amd64__) && !defined(__x86_64__) && !defined(_M_AMD64)) 
//32 bit MMX
static pixman_bool_t
pixman_have_mmx (void)
{
..
return mmx_present;
}
#else 
//64 bit MMX
#define pixman_have_mmx() TRUE
#endif
#endif

#ifdef USE_SSE2
#if (!defined(__amd64__) && !defined(__x86_64__) && !defined(_M_AMD64)) 
//32 bit SSE2
static pixman_bool_t
pixman_have_sse2 (void)
{
...
return sse2_present;
}
#else 
//64 bit SSE2
#define pixman_have_sse2() TRUE
#endif
#endif


>* Siarhei asked whether it would be possible to unify the 32 and 64
>  bit assembly sources. I don't think you commented on that.
I think it is very difficult to unify 32 and 64 bit assemble src. 

>* We need a more detailed description in the git commit log
Will add more detail in next patch

>* The check for the Sun Studio compiler is wrong. If SSSE3 is only
>  supported in the latest version of Sun Studio, then configure should
>  check for that version.
Could you show us the error information from Sun Studio? I never have 
experience on Sun Studio. Appreciate your specific code guide on Sun Studio 
check

>* Will Sun Studio actually accept GNU assembly files?
http://developers.sun.com/sunstudio/support/Ccompare.html mentioned that " 
Support for GNU-style inline assembly ... "

> * Why check for SSSE3 intrinsics at all? You don't use them for
>   anything. All use of SSSE3 is in assembly files. The thing that
>   needs to be checked for is whether the assembly files will pass
>   through the compiler.
Yes, it is SSSE3 asm check. Confirmed and will fix in next patch

>* Store forwarding

>  - We need some comments in the assembly about the store forwarding
>that Ma Ling described.
How about this comments added to asm code:
"CPU doesn't check each address bit for src and dest, so read could fail to 
recognize different address even they are in different page, read operation 
have to force previous write operation to commit data from store buffer to 
cache, the process impact performance seriously. This is work around to avoid 
this cpu limitation."

>  - Siarhei's questions about it should be answered, preferably in
>those comments:

> So it is basically a store forwarding aliasing problem which is
> described in "12.3.3.1 Store Forwarding" section from "Intel(r) 64
> and IA-32 Architectures Optimization Reference Manual", right?
The behavior doesn't belong to store forward which is described in "12.3.3.1 
Store Forwarding" section from "Intel(r) 64 and IA-32 Architectures 
Optimization Reference Manual". That is another HW optimization.

> Wouldn't just the use of MOVD/MOVSS instructions here also solve
> this problem?  Store forwarding does not seem to be used for SIMD
> according to the manual. I haven't benchmarked anything yet
> though.

> http://lists.cairographics.org/archives/pixman/2010-August/000425.html

>* About this line in the 64 bit assembly and the corresponding one in
>  the 32 bit assembly:

>   #if (defined(__amd64__) || defined(__x86_64__) ||defined(_M_AMD64))

>   Makoto Kato asked:

>Why _M_AMD64?  _M_AMD64 macro is for MSVC++ x64 version.  This
>assembler is for GNU as.

>   You didn't comment on this.
Will remove _M_AMD64 inside asm file in next patch

>* We might as well be consistent with the naming used in the ARM NEON
>  assembly files:

>  - The assembly file should be called "pixman-ssse3-asm.S", or
>"pixman-ssse3-x86-32-asm.S and pixman-ssse3-x86-64-asm.S" if there
>is a good reason they can't be unified.
Will change file name in next patch
>  - The function in the assembly should be called something like

>composite_line_src_x888__ssse3
Will change fun name in next patch
>"line" to indicate the fact that it doesn't do full 2D, and the
>"fast_path" in the current name is confusing because it suggest a
>connection to the plain C 'fast path' implementation (there is no
>such connection).

>* Cut-and-paste issues in pixman-ssse3:

>  - #include "pixman-combine32.h"
Will remove it in next patch
>  This is not used as far as I can tell


>  - /* PIXMAN_OP_OVER */
  
>  The two fast paths are both SRC fast paths, not OVER.


>  - /*-
>* composite_over__n_
>*/
>  
>The function is src_x888_(), not over__n_(). But the
>   comment in unneccessary since there is only one function.

Will remove it in next patch


___
Pi

Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-09-02 Thread Soeren Sandmann
Siarhei Siamashka  writes:

> Looks like this data has been posted already:
> http://lists.freedesktop.org/archives/pixman/2010-June/000231.html
> 
> Checking a few more things with microbenchmarks shows that the prefetch 
> distance of just 64 bytes ahead is way too small. 

We should probably get these microbenchmarks into the tree.

Microbenchmarks can be a little dangerous since they sometimes amplify
irrelevant details and make them seem important, but they are clearly
useful in many cases.

> It has to be increased up to something like 256-320 to get good
> memory performance. Apparently software prefetch also disables or
> interferes with the hardware prefetcher on Intel Atom, hurting
> performance a lot. More advanced processors can cope with it.
> 
> But increased prefetch distance is less effective (or can even decrease 
> performance) when dealing with small images, so it is not always good.
> 
> Are there any SSE2 capable x86 processors without hardware prefetch 
> capability? 
> Maybe it's really a good idea to remove software prefetch from SSE2 fast path 
> code?

Yeah, it seems so. All data so far suggests that software prefetching
is somewhere between 'slight slowdown' and 'no real effect'. The SSE2
fast paths have very predictable memory access patterns, so a hardware
prefetcher should be able to do a good job.

(It might be worth investigating whether software prefetch would be
beneficial in the tiled rotation fast paths, since the access patterns
there could be much harder to predict for the hardware).


Soren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-09-02 Thread Soeren Sandmann
"Xu, Samuel"  writes:

> Hi, Siarhei Siamashka:
>   Attached patch has updated copyright part (only copyright change). we 
> referred http://cgit.freedesktop.org/pixman/tree/COPYING.
>   Yes, As you assumed, we tested on multiple 32/64 bit boxes w/o
> and w/ SSSE3. 

Here are some more comments:

* The 64 bit CPU detection is broken. It doesn't use SSE2 and MMX when
  SSSE3 is not detected.

* Siarhei asked whether it would be possible to unify the 32 and 64
  bit assembly sources. I don't think you commented on that.

* We need a more detailed description in the git commit log

* The check for the Sun Studio compiler is wrong. If SSSE3 is only
  supported in the latest version of Sun Studio, then configure should
  check for that version.

* Will Sun Studio actually accept GNU assembly files?

* Why check for SSSE3 intrinsics at all? You don't use them for
  anything. All use of SSSE3 is in assembly files. The thing that
  needs to be checked for is whether the assembly files will pass
  through the compiler.

* Store forwarding

  - We need some comments in the assembly about the store forwarding
that Ma Ling described.

  - Siarhei's questions about it should be answered, preferably in
those comments:

 So it is basically a store forwarding aliasing problem which is
 described in "12.3.3.1 Store Forwarding" section from "Intel® 64
 and IA-32 Architectures Optimization Reference Manual", right?

 Wouldn't just the use of MOVD/MOVSS instructions here also solve
 this problem?  Store forwarding does not seem to be used for SIMD
 according to the manual. I haven't benchmarked anything yet
 though.

 http://lists.cairographics.org/archives/pixman/2010-August/000425.html

* About this line in the 64 bit assembly and the corresponding one in
  the 32 bit assembly:

   #if (defined(__amd64__) || defined(__x86_64__) ||defined(_M_AMD64))

   Makoto Kato asked:

Why _M_AMD64?  _M_AMD64 macro is for MSVC++ x64 version.  This
assembler is for GNU as.

   You didn't comment on this.

* We might as well be consistent with the naming used in the ARM NEON
  assembly files:

  - The assembly file should be called "pixman-ssse3-asm.S", or
"pixman-ssse3-x86-32-asm.S and pixman-ssse3-x86-64-asm.S" if there
is a good reason they can't be unified.

  - The function in the assembly should be called something like

composite_line_src_x888__ssse3

"line" to indicate the fact that it doesn't do full 2D, and the
"fast_path" in the current name is confusing because it suggest a
connection to the plain C 'fast path' implementation (there is no
such connection).

* Cut-and-paste issues in pixman-ssse3:

  - #include "pixman-combine32.h"

  This is not used as far as I can tell


  - /* PIXMAN_OP_OVER */
  
  The two fast paths are both SRC fast paths, not OVER.


  - /*-
* composite_over__n_
*/
  
The function is src_x888_(), not over__n_(). But the
comment in unneccessary since there is only one function.


Soren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-09-02 Thread Siarhei Siamashka
On Monday 30 August 2010 23:31:35 Siarhei Siamashka wrote:
> And Intel Atom does not like software prefetch very much. This reminds me an
> older report:
> http://lists.freedesktop.org/archives/pixman/2010-June/000218.html
> 
> I can try to run a full set of cairo-perf-trace benchmarks to get more
> relevant numbers regarding the effect of software prefetch on Intel Atom.

Looks like this data has been posted already:
http://lists.freedesktop.org/archives/pixman/2010-June/000231.html

Checking a few more things with microbenchmarks shows that the prefetch 
distance of just 64 bytes ahead is way too small. It has to be increased up to 
something like 256-320 to get good memory performance. Apparently software 
prefetch also disables or interferes with the hardware prefetcher on Intel 
Atom, hurting performance a lot. More advanced processors can cope with it.

But increased prefetch distance is less effective (or can even decrease 
performance) when dealing with small images, so it is not always good.

Are there any SSE2 capable x86 processors without hardware prefetch capability? 
Maybe it's really a good idea to remove software prefetch from SSE2 fast path 
code?

-- 
Best regards,
Siarhei Siamashka


signature.asc
Description: This is a digitally signed message part.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-30 Thread Siarhei Siamashka
On Monday 30 August 2010 16:06:45 Siarhei Siamashka wrote:
> Also I tried to run my simple microbenchmarking program on Intel Atom N450
> netbook, x86_64 system. The results are the following:
> 
> --
> All results are presented in millions of pixels per second
> L1  - small Xx1 rectangle (fitting L1 cache), always blitted at the same
>   memory location with small drift in horizontal direction
> L2  - small XxY rectangle (fitting L2 cache), always blitted at the same
>   memory location with small drift in horizontal direction
> M   - large 1856x1080 rectangle, always blitted at the same
>   memory location with small drift in horizontal direction
> HT  - random rectangles with 32x32 average size are copied from
>   one 1920x1080 buffer to another, traversing from left to right
>   and from top to bottom
> VT  - random rectangles with 32x32 average size are copied from
>   one 1920x1080 buffer to another, traversing from top to bottom
>   and from left to right
> R   - random rectangles with 32x32 average size are copied from
>   random locations of one 1920x1080 buffer to another
> ---
> reference memcpy speed = 1007.2MB/s (251.8MP/s for 32bpp pixels)
> 
> --- C ---
>  src_x888_ = L1: 265.13 L2: 223.61 M:216.12 HT:109.69 VT: 80.23 R: 75.42
> 
> --- SSE2 ---
>  src_x888_ = L1: 611.50 L2: 494.79 M:120.17 HT: 83.57 VT: 79.48 R: 60.30
> 
> --- SSE2 (prefetch removed) ---
>  src_x888_ = L1: 683.39 L2: 539.57 M:260.07 HT:128.22 VT: 85.23 R: 81.40
> 
> --- SSSE3 ---
>  src_x888_ = L1:1559.35 L2: 798.95 M:254.31 HT:129.29 VT: 84.67 R: 75.00

And now a benchmark from Intel Core i7 860 too (with the buffer sizes
increased to be sure that they are much larger than 8MB cache):

---
reference memcpy speed = 6491.5MB/s (1622.9MP/s for 32bpp pixels)

--- C ---
 src_x888_ = L1:1145.72 L2:1124.48 M:902.11 HT:438.52 VT:351.58 R:316.74

--- SSE2 ---
 src_x888_ = L1:5628.96 L2:2766.65 M:1060.89 HT:576.04 VT:488.81 R:418.52

--- SSE2 (prefetch removed) ---
 src_x888_ = L1:6140.86 L2:2730.58 M:1066.52 HT:594.88 VT:484.12 R:423.24

--- SSSE3 ---
 src_x888_ = L1:5182.91 L2:2617.74 M:1064.30 HT:673.20 VT:568.15 R:441.97


So looks like on Intel Core i7 the differences between all the SSE2/SSSE3
implementations are mostly irrelevant. Standard memcpy shows better
memory bandwidth because it uses nontemporal writes for copying very large
buffers. If using save_128_write_combining() instead of save_128_aligned() in
sse2_composite_src_x888_(), then src_x888_ operation becomes just as
fast as memcpy.

For Intel Atom, SSSE3 seems to improve performance a lot when working with
L1/L2 cache, but has no advantage when working with large memory buffers. Still
I think it may be useful for hyperthreading (the other virtual thread may have
more resources to do something while src_x888_ is waiting for memory). On
the other hand, nontemporal writes can increase performance noticeably if the
buffer is known to be much bigger than L2 cache.

And Intel Atom does not like software prefetch very much. This reminds me an
older report:
http://lists.freedesktop.org/archives/pixman/2010-June/000218.html

I can try to run a full set of cairo-perf-trace benchmarks to get more relevant
numbers regarding the effect of software prefetch on Intel Atom.

-- 
Best regards,
Siarhei Siamashka


signature.asc
Description: This is a digitally signed message part.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-30 Thread Siarhei Siamashka
On Monday 30 August 2010 12:26:27 Xu, Samuel wrote:
> Hi, Siarhei Siamashka:
>   Sorry for a typo, fixed version is attached. Pls ignore pervious mail.
> 
>   New patch, which contains
>   1)Simplified 64 bit detect_cpu_features(), only check SSSE3 bit. 
> _MSC_VER
> path switched to __cpuid to avoid inline asm. _MSC_VER path CPUID code
> snatch is extracted and tested on standalone C file in a 64 bit windows
> system, using VS2010. 2) removed "merging", pixman-ssse3.c is shaped as
> our discussion.

+/*
+ * Copyright 2010 Intel Corporation
+ *
+ * Permission to use, copy, modify, distribute, and sell this software and its
+ * documentation for any purpose is hereby granted without fee, provided that
+ * the above copyright notice appear in all copies and that both that
+ * copyright notice and this permission notice appear in supporting
+ * documentation, and that the name of Mozilla Corporation not be used in
^^^

Mozilla Corporation? Looks like this is yet another case of careless copy/paste.

And I would like to remind again that the following text of copyright notice is
preferred for new code. It is not strictly necessary, but I just want to be
sure that you checked this link:
http://cgit.freedesktop.org/pixman/tree/COPYING

Other than that, and assuming that the patch was properly tested on 32-bit and 
64-bit machines both with and without SSSE3 support, I don't see any remaining
really blocker issues. Or more like I'm giving up and would like to pass the 
baton to somebody else.


Also I tried to run my simple microbenchmarking program on Intel Atom N450 
netbook, x86_64 system. The results are the following:

--
All results are presented in millions of pixels per second
L1  - small Xx1 rectangle (fitting L1 cache), always blitted at the same
  memory location with small drift in horizontal direction
L2  - small XxY rectangle (fitting L2 cache), always blitted at the same
  memory location with small drift in horizontal direction
M   - large 1856x1080 rectangle, always blitted at the same
  memory location with small drift in horizontal direction
HT  - random rectangles with 32x32 average size are copied from
  one 1920x1080 buffer to another, traversing from left to right
  and from top to bottom
VT  - random rectangles with 32x32 average size are copied from
  one 1920x1080 buffer to another, traversing from top to bottom
  and from left to right
R   - random rectangles with 32x32 average size are copied from
  random locations of one 1920x1080 buffer to another
---
reference memcpy speed = 1007.2MB/s (251.8MP/s for 32bpp pixels)

--- C ---
 src_x888_ = L1: 265.13 L2: 223.61 M:216.12 HT:109.69 VT: 80.23 R: 75.42

--- SSE2 ---
 src_x888_ = L1: 611.50 L2: 494.79 M:120.17 HT: 83.57 VT: 79.48 R: 60.30

--- SSE2 (prefetch removed) ---
 src_x888_ = L1: 683.39 L2: 539.57 M:260.07 HT:128.22 VT: 85.23 R: 81.40

--- SSSE3 ---
 src_x888_ = L1:1559.35 L2: 798.95 M:254.31 HT:129.29 VT: 84.67 R: 75.00


-- 
Best regards,
Siarhei Siamashka


signature.asc
Description: This is a digitally signed message part.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-29 Thread Siarhei Siamashka
On Sunday 29 August 2010 12:25:47 Xu, Samuel wrote:
> Hi, Siarhei Siamashka:
> Q:---" What problems do you have without "merge" mechanism?"
> A: Of course there isn't correctness issue w/o "merge".
> Currently, sse2_fast_paths/mmx_fast_paths/c_fast_paths...are excluded each
> other, although some checking forms delegate chain. While after delegate
> chain formed, only one fast table effect. Is my understanding correct? So
> after ssse3_fast_paths is newly added, only SSSE3 table will be effect
> after SSSE3 CPUID is detected. Of course we can just keep 2 new entries
> with SSSE3 asm optimizations, the issue is: we lost the optimizations
> which already exist in current SSE2 table. So, w/o merging or w/o totally
> duplication from SSE2 file, there will be performance unfriendly.

No, all the fast path tables are tried one after another, so there should be no 
problem using SSE2 and MMX fast paths as fallbacks. That's the whole point of
having delegates. If you see any issue here, that's likely a bug.

Still merging fast path tables may make the search for the needed fast path 
function a bit faster in the case of a fast path cache miss. And there might be 
a bit more motivation to pursue such optimizations once SSE2 is not at a top of 
delegate chain anymore ;) This is especially important for the combiner 
functions which are called for each scanline, so walking the delegate chain is 
highly undesirable there. The simple solution at least for the combiner 
functions is to just copy all the pointers for unimplemented functions from the 
previous implementation at setup time. And thus make sure that 
'delegate_combine_32' is never ever called. I tried to propose something like 
this 'pixman_blt' delegates earlier (for example, see "imp->blt = fast->blt;" 
for the changes in pixman-vmx.c):
http://cgit.freedesktop.org/~siamashka/pixman/log/?h=no-delegates

But as I mentioned before, any merging of fast path tables or improvement for
the delegates chain handling is an additional optimization. And it belongs to a
separate patch.

> Sure, It is ok for us for this "correctness firstly, then performance in
> next wave" philosophy. A new SSSE3 file with only 2 fast path entries is
> ok?

Yes, this looks right.

> For simplifying on CPU detection, I think same " correctness firstly, then
> performance in next wave" philosophy might be followed. A full CPU
> detection for 64 bit can be added firstly as a baseline, the one who can
> test win32 and Solaris can help us to make it shorter.

What "correctness" you are talking about? As already noticed by  
Makoto Kato earlier [1], MSVC does not support inline assembly in 64-bit
mode  So your patch was clearly broken and untested in this configuration.

IMHO it is always easier to add missing functionality in the future (support
for SSSE3 optimizations on the systems other than linux) than removing broken
untested system-dependent garbage from the source code. Especially considering
that these other systems still need to do something about the GNU assembler
part of your patch. But again, a comment from somebody else would be very much
welcome here. I don't have a strong opinion about this part.


1. http://lists.freedesktop.org/archives/pixman/2010-August/000424.html

-- 
Best regards,
Siarhei Siamashka


signature.asc
Description: This is a digitally signed message part.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-29 Thread Xu, Samuel
Hi, Siarhei Siamashka:
Q:---" What problems do you have without "merge" mechanism?"
A: Of course there isn't correctness issue w/o "merge".
Currently, sse2_fast_paths/mmx_fast_paths/c_fast_paths...are excluded each 
other, although some checking forms delegate chain. While after delegate chain 
formed, only one fast table effect. Is my understanding correct?
So after ssse3_fast_paths is newly added, only SSSE3 table will be effect after 
SSSE3 CPUID is detected. Of course we can just keep 2 new entries with SSSE3 
asm optimizations, the issue is: we lost the optimizations which already exist 
in current SSE2 table. So, w/o merging or w/o totally duplication from SSE2 
file, there will be performance unfriendly. 

Sure, It is ok for us for this "correctness firstly, then performance in next 
wave" philosophy. A new SSSE3 file with only 2 fast path entries is ok?

For simplifying on CPU detection, I think same " correctness firstly, then 
performance in next wave" philosophy might be followed. A full CPU detection 
for 64 bit can be added firstly as a baseline, the one who can test win32 and 
Solaris can help us to make it shorter. 

As a wrap, new path will:
1) keep most 64 bit CPU detection as currently patch (GNU C part can reduce 
some edx checking)
2) A new SSSE3 file with only 2 fast path entries for newly added ASM 
optimization

Comments?

Samuel


-Original Message-
From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] 
Sent: Friday, August 27, 2010 10:57 PM
To: Xu, Samuel
Cc: pixman@lists.freedesktop.org; Ma, Ling; Liu, Xinyun
Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

On Friday 27 August 2010 15:00:49 Xu, Samuel wrote:
> Hi, Siarhei Siamashka:
>   Thanks for quick response!
>   For 64 bit detect_cpu_features(), if ignore HAVE_GETISAX and _MSC_VER,
>   it is ok for us to simplify it as your example in next update.

If you can ensure MSVC compatibility and make it work with your optimizations, 
then it would be really great. But if it is totally untested, I don't feel 
comfortable about having it just blindly replicated from 32 to 64 bits with the 
hope that it will work.

It's just my opinion, the others may disagree. And the others may also try to 
test your patch on win32 or solaris systems, providing a lot more useful 
feedback than me.

> For pixman-ssse3.c, maybe we have 2 options:
>  1) duplicate 6562 lines from pixman-sse2.c to new pixman-
> ssse3.c in 1st patch (of course to replace 2 entries with newly added
> SSSE3 asm optimization), and then add "merge" mechanism in later patch.

No, there is no need to duplicate anything.

>  2) firstly add "merge" mechanism patch, and the added new pixman-ssse3.c in
> later patch, which might be shorter (111 lines) Does it mean 
> 1) option is preferred?

What problems do you have without "merge" mechanism? The pixman-sse2.c works 
fine without it, and it does properly fallback to MMX code if SSE2 does not 
support some operations. Similarly, SSSE3 can fallback to SSE2 in the very same 
way.

--
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-27 Thread Siarhei Siamashka
On Friday 27 August 2010 15:00:49 Xu, Samuel wrote:
> Hi, Siarhei Siamashka:
>   Thanks for quick response!
>   For 64 bit detect_cpu_features(), if ignore HAVE_GETISAX and _MSC_VER,
>   it is ok for us to simplify it as your example in next update.

If you can ensure MSVC compatibility and make it work with your optimizations, 
then it would be really great. But if it is totally untested, I don't feel
comfortable about having it just blindly replicated from 32 to 64 bits with the 
hope that it will work.

It's just my opinion, the others may disagree. And the others may also try to
test your patch on win32 or solaris systems, providing a lot more useful
feedback than me.

> For pixman-ssse3.c, maybe we have 2 options:
>  1) duplicate 6562 lines from pixman-sse2.c to new pixman-
> ssse3.c in 1st patch (of course to replace 2 entries with newly added
> SSSE3 asm optimization), and then add "merge" mechanism in later patch.

No, there is no need to duplicate anything.

>  2) firstly add "merge" mechanism patch, and the added new pixman-ssse3.c in
> later patch, which might be shorter (111 lines) Does it mean 
> 1) option is preferred?

What problems do you have without "merge" mechanism? The pixman-sse2.c works
fine without it, and it does properly fallback to MMX code if SSE2 does not
support some operations. Similarly, SSSE3 can fallback to SSE2 in the very same 
way.

-- 
Best regards,
Siarhei Siamashka


signature.asc
Description: This is a digitally signed message part.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-27 Thread Xu, Samuel
I guess detect_cpu_features() is trying to support both MSVC and GNU. For 
CPUID, intric is not used. I am not sure the real reason. While I didn't find 
CPUID intric in intric header for GNU.

-Original Message-
From: Makoto Kato [mailto:m_k...@ga2.so-net.ne.jp] 
Sent: Friday, August 27, 2010 2:15 PM
To: Xu, Samuel
Cc: pixman@lists.freedesktop.org
Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

  Hi, Xu.

 > --- /dev/null
 > +++ b/pixman/pixman-access-ssse3_x86-64.S
..
 > +#if (defined(__amd64__) || defined(__x86_64__) ||defined(_M_AMD64))

why _M_AMD64?  _M_AMD64 macro is for MSVC++ x64 version.  This assembler 
is for GNU as.

Also, if MSVC++ 8.0 (14.00) or later, we should use __cpuid for x64 and 
x86 compatibility.
http://msdn.microsoft.com/ja-jp/library/hskdteyh.aspx


-- Makoto

On 2010/08/27 11:59, Xu, Samuel wrote:
> Hi Siarhei Siamashka,
>
> Here is a new patch, can you review it? Thank you!
> It address following suggestions:
> 1: SSSE3 file is split to a new file. Comparing with to duplicate every 
> content from SSE2 file, I added a way to merge new fast path function pointer 
> table runtimely by CPUID check, to avoid huge code expansion and a bug you 
> pointed out.(Without runtime merging, I have to duplicate every line from 
> SSE2 file to SSSE3 file, it is really not graceful and redundant.
> 2: Copy right information added
> 3: Makefile fix
> 4: author name fix
>
> For detect_cpu_features(), we possibly can remove the TWO lines asm code to 
> copy edx to "result", to avoid SSE2 and MMX checking, since most preparing 
> code still need. If we keep this 2 lines, it still give us some possibility 
> to check other edx information of CPUID in the future. It is only executed 
> once when pixman is loaded.
>
> For ASM code of 'bk_write' and 'table_48_bytes_bwd', here is the reason of 
> "why we use backward copy mode in the patch ", from Ling:
> All read operations after allocation stage can run speculatively, all write 
> operation will run in program order, and if addresses are different read may 
> run before older write operation, otherwise wait until write commit. However 
> CPU don't check each address bit, so read could fail to recognize different 
> address even theyare in different page.
> For example if rsi is 0xf004, rdi is 0xe008,in following operation there will 
> generate big performance latency.
> 1. movq (%rsi),   %rax
> 2. movq %rax, (%rdi)
> 3. movq 8(%rsi),  %rax
> 4. movq %rax, 8(%rdi)
>
> If %rsi and rdi were in really the same meory page, there are TRUE 
> read-after-write dependence because instruction 2 write 0x008 and instruction 
> 3 read 0x00c, the two address are overlap partially. Actually there are in 
> different page and no any issues, but without checking each address bit CPU 
> could think they are in the same page, and instruction 3 have to wait for 
> instruction 2 to write data into cache from write buffer, then load data from 
> cache, the cost time read spent is equal to mfence instruction.
> We may avoid it by tuning operation sequence as follow.
> 1. movq 8(%rsi), %rax
> 2. movq %rax, 8(%rdi)
> 3. movq (%rsi),   %rax
> 4. movq %rax, (%rdi)
> Instruction 3 read 0x004, instruction 2 write address 0x010, no any 
> dependence.
> In this patch we first handle small size(less 48bytes) by this way, and the 
> approach help us improve up to 2X on Atom.
>
> Regards,
> Liu, Xinyun
>
> -Original Message-
> From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com]
> Sent: Sunday, August 22, 2010 1:49 AM
> To: Liu, Xinyun
> Cc: pixman@lists.freedesktop.org; Ma, Ling; Xu, Samuel
> Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8
>
> On Friday 20 August 2010 18:39:47 Liu, Xinyun wrote:
>> Hi Siarhei Siamashka,
>>
>> Here is a new patch, can you review it? Thank you!
> Sure, thanks for the updated patch. Some comments follow.
>
>>  From 9783651899a2763d7fcca2960fc354bd1f541980 Mon Sep 17 00:00:00 2001
>> From: root
> A minor nitpick here. The 'From:' header seems to be wrong, it does not look 
> like a real patch author name.
>
>> +dnl Check for SSSE3
>> +
>> +if test "x$SSSE3_CFLAGS" = "x" ; then
>> +   if test "x$SUNCC" = "xyes"; then
>> +  # SSSE3 is enabled by default in the Sun Studio 64-bit
>> +environment
> Is it a valid statement with the regards to Sun Studio? Looks like it is a 
> direct copy/paste from SSE2. Can we be sure that there is nothing redundant 
> or just plain wrong here?
>
>> +#if defined(__GNUC__)&&  (__GNUC__<  4 || (__GNUC__ == 4&&
>&g

Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-27 Thread Xu, Samuel
Hi, Siarhei Siamashka: 
Thanks for quick response!
For 64 bit detect_cpu_features(), if ignore HAVE_GETISAX and _MSC_VER, 
it is ok for us to simplify it as your example in next update.
For pixman-ssse3.c, maybe we have 2 options:
1) duplicate 6562 lines from pixman-sse2.c to new 
pixman-ssse3.c in 1st patch (of course to replace 2 entries with newly added 
SSSE3 asm optimization), and then add "merge" mechanism in later patch.
2) firstly add "merge" mechanism patch, and the added new 
pixman-ssse3.c in later patch, which might be shorter (111 lines)
Does it mean 1) option is preferred?

We will response soon after your guide on how to split SSSE3 C file.

Samuel


-Original Message-
From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] 
Sent: Friday, August 27, 2010 6:00 PM
To: Xu, Samuel
Cc: pixman@lists.freedesktop.org; Ma, Ling; Liu, Xinyun
Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

On Friday 27 August 2010 05:59:00 Xu, Samuel wrote:
> Hi Siarhei Siamashka,
> 
> Here is a new patch, can you review it? Thank you!
> It address following suggestions:
> 1: SSSE3 file is split to a new file.

Thanks.

> Comparing with to duplicate every
> content from SSE2 file, I added a way to merge new fast path function 
> pointer table runtimely by CPUID check, to avoid huge code expansion 
> and a bug you pointed out.(Without runtime merging, I have to 
> duplicate every line from SSE2 file to SSSE3 file, it is really not 
> graceful and redundant.

Just implementing everything in the same way as done in SSE2 code, you would 
get a proper delegate chain trying SSSE3 fast paths first, then trying SSE2, 
MMX, C fast paths (compositing in one pass) and finally fallback to a slow 
general path code (fetch, combine and store steps done in separate passes).

The runtime merging is not strictly needed, even though it could probably 
improve performance somewhat. But this clearly belongs to a separate patch.

> 2: Copy right information added
> 3: Makefile fix
> 4: author name fix
> 
> For detect_cpu_features(), we possibly can remove the TWO lines asm 
> code to copy edx to "result", to avoid SSE2 and MMX checking, since 
> most preparing code still need. If we keep this 2 lines, it still give 
> us some possibility to check other edx information of CPUID in the 
> future. It is only executed once when pixman is loaded.

Wouldn't it be just as simple as doing the following in the block of code under 
__GNUC__ and __amd64__ or __x86_64__ ifdefs?

int have_ssse3_support ()
{
uint32_t cpuid_ecx;
__asm__ (
"mov $1, %%eax\n"
"cpuid\n"
"mov %%ecx, %0\n"
: "=r" (cpuid_ecx)
:
: "%rax", "%rbx", "%rcx", "%rdx"
);
return cpuid_ecx & (1 << 9);
}

And we already know that we have MMX and SSE2 on all x86_64 hardware even 
without checking.

MSVC and any compilers other than GNU C are probably not so interesting 
initially. They just should not fail when building pixman.

> For ASM code of 'bk_write' and 'table_48_bytes_bwd', here is the 
> reason of "why we use backward copy mode in the patch ", from Ling: 
> All read operations after allocation stage can run speculatively, all 
> write operation will run in program order, and if addresses are 
> different read may run before older write operation, otherwise wait until 
> write commit.
> However CPU don't check each address bit, so read could fail to 
> recognize different address even theyare in different page. For 
> example if rsi is 0xf004, rdi is 0xe008,in following operation there will 
> generate big
> performance latency. 1. movq (%rsi),  %rax
> 2. movq %rax, (%rdi)
> 3. movq 8(%rsi),  %rax
> 4. movq %rax, 8(%rdi)
> 
> If %rsi and rdi were in really the same meory page, there are TRUE 
> read-after-write dependence because instruction 2 write 0x008 and 
> instruction 3 read 0x00c, the two address are overlap partially. 
> Actually there are in different page and no any issues, but without 
> checking each address bit CPU could think they are in the same page, 
> and instruction 3 have to wait for instruction 2 to write data into 
> cache from write buffer, then load data from cache, the cost time read 
> spent is equal to mfence instruction. We may avoid it by tuning operation 
> sequence as follow.
> 1. movq 8(%rsi), %rax
> 2. movq %rax, 8(%rdi)
> 3. movq (%rsi),   %rax
> 4. movq %rax, (%rdi)
> Instruction 3 read 0x004, instruction 2 write address 0x010, no any 
> dependence. In this patch we first handle small size(less 48bytes) by 
> this way, and the approach help us

Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-27 Thread Siarhei Siamashka
On Friday 27 August 2010 05:59:00 Xu, Samuel wrote:
> Hi Siarhei Siamashka,
> 
> Here is a new patch, can you review it? Thank you!
> It address following suggestions:
> 1: SSSE3 file is split to a new file.

Thanks.

> Comparing with to duplicate every
> content from SSE2 file, I added a way to merge new fast path function
> pointer table runtimely by CPUID check, to avoid huge code expansion and a
> bug you pointed out.(Without runtime merging, I have to duplicate every
> line from SSE2 file to SSSE3 file, it is really not graceful and
> redundant.

Just implementing everything in the same way as done in SSE2 code, you would 
get a proper delegate chain trying SSSE3 fast paths first, then trying SSE2, 
MMX, C fast paths (compositing in one pass) and finally fallback to a slow 
general path code (fetch, combine and store steps done in separate passes).

The runtime merging is not strictly needed, even though it could probably 
improve performance somewhat. But this clearly belongs to a separate patch.

> 2: Copy right information added
> 3: Makefile fix
> 4: author name fix
> 
> For detect_cpu_features(), we possibly can remove the TWO lines asm code to
> copy edx to "result", to avoid SSE2 and MMX checking, since most preparing
> code still need. If we keep this 2 lines, it still give us some
> possibility to check other edx information of CPUID in the future. It is
> only executed once when pixman is loaded.

Wouldn't it be just as simple as doing the following in the block of code under 
__GNUC__ and __amd64__ or __x86_64__ ifdefs?

int have_ssse3_support ()
{
uint32_t cpuid_ecx;
__asm__ (
"mov $1, %%eax\n"
"cpuid\n"
"mov %%ecx, %0\n"
: "=r" (cpuid_ecx)
:
: "%rax", "%rbx", "%rcx", "%rdx"
);
return cpuid_ecx & (1 << 9);
}

And we already know that we have MMX and SSE2 on all x86_64 hardware even 
without checking.

MSVC and any compilers other than GNU C are probably not so interesting 
initially. They just should not fail when building pixman.

> For ASM code of 'bk_write' and 'table_48_bytes_bwd', here is the reason of
> "why we use backward copy mode in the patch ", from Ling: All read
> operations after allocation stage can run speculatively, all write
> operation will run in program order, and if addresses are different read
> may run before older write operation, otherwise wait until write commit.
> However CPU don't check each address bit, so read could fail to recognize
> different address even theyare in different page. For example if rsi is
> 0xf004, rdi is 0xe008,in following operation there will generate big
> performance latency. 1. movq (%rsi),  %rax
> 2. movq %rax, (%rdi)
> 3. movq 8(%rsi),  %rax
> 4. movq %rax, 8(%rdi)
> 
> If %rsi and rdi were in really the same meory page, there are TRUE
> read-after-write dependence because instruction 2 write 0x008 and
> instruction 3 read 0x00c, the two address are overlap partially. Actually
> there are in different page and no any issues, but without checking each
> address bit CPU could think they are in the same page, and instruction 3
> have to wait for instruction 2 to write data into cache from write buffer,
> then load data from cache, the cost time read spent is equal to mfence
> instruction. We may avoid it by tuning operation sequence as follow.
> 1. movq 8(%rsi), %rax
> 2. movq %rax, 8(%rdi)
> 3. movq (%rsi),   %rax
> 4. movq %rax, (%rdi)
> Instruction 3 read 0x004, instruction 2 write address 0x010, no any
> dependence. In this patch we first handle small size(less 48bytes) by this
> way, and the approach help us improve up to 2X on Atom.

Thanks for the explanation. Some comments describing why it was done this way 
in the code for such non-obvious cases would be generally nice. 

So it is basically a store forwarding aliasing problem which is described in 
"12.3.3.1 Store Forwarding" section from "Intel® 64 and IA-32 Architectures 
Optimization Reference Manual", right?

Wouldn't just the use of MOVD/MOVSS instructions here also solve this problem? 
Store forwarding does not seem to be used for SIMD according to the manual. I 
haven't benchmarked anything yet though.

Moreover, if you have plans to generalize and extend SSSE3 optimizations to 
accelerate more stuff later, then something like 'src__' and 
'add__' operations would be the really easy ones. But it would be very 
nice to avoid dumb copy/paste and 3x increase of assembly sources size. For 
this to go smoothly, preferably you would need to separate the loop code itself 
from the pixel processing logic. And in order to make pixel processing more 
simple and consistent, it's better to do all the pixel processing using SIMD
instructions (both inner loop and also leading/trailing pixels). Leading and 
trailing unaligned pixels may be probably handled by using MOVD/MOVSS/MOVHPS or 
other similar instructions to do partial SSE register load/store operations. 
It's not like this 

Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-27 Thread Makoto Kato

 Hi, Xu.

> --- /dev/null
> +++ b/pixman/pixman-access-ssse3_x86-64.S
..
> +#if (defined(__amd64__) || defined(__x86_64__) ||defined(_M_AMD64))

why _M_AMD64?  _M_AMD64 macro is for MSVC++ x64 version.  This assembler 
is for GNU as.


Also, if MSVC++ 8.0 (14.00) or later, we should use __cpuid for x64 and 
x86 compatibility.

http://msdn.microsoft.com/ja-jp/library/hskdteyh.aspx


-- Makoto

On 2010/08/27 11:59, Xu, Samuel wrote:

Hi Siarhei Siamashka,

Here is a new patch, can you review it? Thank you!
It address following suggestions:
1: SSSE3 file is split to a new file. Comparing with to duplicate every content 
from SSE2 file, I added a way to merge new fast path function pointer table 
runtimely by CPUID check, to avoid huge code expansion and a bug you pointed 
out.(Without runtime merging, I have to duplicate every line from SSE2 file to 
SSSE3 file, it is really not graceful and redundant.
2: Copy right information added
3: Makefile fix
4: author name fix

For detect_cpu_features(), we possibly can remove the TWO lines asm code to copy edx to 
"result", to avoid SSE2 and MMX checking, since most preparing code still need. 
If we keep this 2 lines, it still give us some possibility to check other edx information 
of CPUID in the future. It is only executed once when pixman is loaded.

For ASM code of 'bk_write' and 'table_48_bytes_bwd', here is the reason of "why we 
use backward copy mode in the patch ", from Ling:
All read operations after allocation stage can run speculatively, all write 
operation will run in program order, and if addresses are different read may 
run before older write operation, otherwise wait until write commit. However 
CPU don't check each address bit, so read could fail to recognize different 
address even theyare in different page.
For example if rsi is 0xf004, rdi is 0xe008,in following operation there will 
generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax,   (%rdi)
3. movq 8(%rsi),%rax
4. movq %rax,   8(%rdi)

If %rsi and rdi were in really the same meory page, there are TRUE 
read-after-write dependence because instruction 2 write 0x008 and instruction 3 
read 0x00c, the two address are overlap partially. Actually there are in 
different page and no any issues, but without checking each address bit CPU 
could think they are in the same page, and instruction 3 have to wait for 
instruction 2 to write data into cache from write buffer, then load data from 
cache, the cost time read spent is equal to mfence instruction.
We may avoid it by tuning operation sequence as follow.
1. movq 8(%rsi), %rax
2. movq %rax,   8(%rdi)
3. movq (%rsi), %rax
4. movq %rax,   (%rdi)
Instruction 3 read 0x004, instruction 2 write address 0x010, no any dependence.
In this patch we first handle small size(less 48bytes) by this way, and the 
approach help us improve up to 2X on Atom.

Regards,
Liu, Xinyun

-Original Message-
From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com]
Sent: Sunday, August 22, 2010 1:49 AM
To: Liu, Xinyun
Cc: pixman@lists.freedesktop.org; Ma, Ling; Xu, Samuel
Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

On Friday 20 August 2010 18:39:47 Liu, Xinyun wrote:

Hi Siarhei Siamashka,

Here is a new patch, can you review it? Thank you!

Sure, thanks for the updated patch. Some comments follow.


 From 9783651899a2763d7fcca2960fc354bd1f541980 Mon Sep 17 00:00:00 2001
From: root

A minor nitpick here. The 'From:' header seems to be wrong, it does not look 
like a real patch author name.


+dnl Check for SSSE3
+
+if test "x$SSSE3_CFLAGS" = "x" ; then
+   if test "x$SUNCC" = "xyes"; then
+  # SSSE3 is enabled by default in the Sun Studio 64-bit
+environment

Is it a valid statement with the regards to Sun Studio? Looks like it is a 
direct copy/paste from SSE2. Can we be sure that there is nothing redundant or 
just plain wrong here?


+#if defined(__GNUC__)&&  (__GNUC__<  4 || (__GNUC__ == 4&&
+__GNUC_MINOR__<

3))

+#   if !defined(__amd64__)&&  !defined(__x86_64__)

Is this 'defined(__amd64__)' check needed here? Looks like it is again a 
copy/paste from SSE2 code, but SSE2 is guaranteed to be supported for x86-64 
systems.


+#  error "Need GCC>= 4.3 for SSSE3 intrinsics on x86"
+#   endif
+#endif



diff --git a/pixman/pixman-access-ssse3_x86-64.S
b/pixman/pixman-access-

ssse3_x86-64.S

new file mode 100644
index 000..fec93df
--- /dev/null
+++ b/pixman/pixman-access-ssse3_x86-64.S

Copyright notice seems to be still missing in the newly added source files.


+   cmp %dil, %sil
+   jb  L(bk_write)
+   add %rdx, %rsi
+   add %rdx, %rdi
+   BRANCH_TO_JMPTBL_ENTRY (L(table_48bytes_fwd), %rdx, 4)
+L(bk_write):
+   BRANCH_TO_JMPTBL_ENTRY (L(table_48_bytes_bwd), %rdx, 4)

Would it ma

Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-21 Thread Siarhei Siamashka
On Friday 20 August 2010 18:39:47 Liu, Xinyun wrote:
> Hi Siarhei Siamashka,
> 
> Here is a new patch, can you review it? Thank you!

Sure, thanks for the updated patch. Some comments follow.

> From 9783651899a2763d7fcca2960fc354bd1f541980 Mon Sep 17 00:00:00 2001
> From: root 

A minor nitpick here. The 'From:' header seems to be wrong, it does not look 
like a real patch author name.

> +dnl Check for SSSE3
> +
> +if test "x$SSSE3_CFLAGS" = "x" ; then
> +   if test "x$SUNCC" = "xyes"; then
> +  # SSSE3 is enabled by default in the Sun Studio 64-bit environment

Is it a valid statement with the regards to Sun Studio? Looks like it is a 
direct copy/paste from SSE2. Can we be sure that there is nothing redundant or 
just plain wrong here? 

> +#if defined(__GNUC__) && (__GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ < 
3))
> +#   if !defined(__amd64__) && !defined(__x86_64__)

Is this 'defined(__amd64__)' check needed here? Looks like it is again a 
copy/paste from SSE2 code, but SSE2 is guaranteed to be supported for x86-64 
systems.

> +#  error "Need GCC >= 4.3 for SSSE3 intrinsics on x86"
> +#   endif
> +#endif


> diff --git a/pixman/pixman-access-ssse3_x86-64.S b/pixman/pixman-access-
ssse3_x86-64.S
> new file mode 100644
> index 000..fec93df
> --- /dev/null
> +++ b/pixman/pixman-access-ssse3_x86-64.S

Copyright notice seems to be still missing in the newly added source files.

> + cmp %dil, %sil
> + jb  L(bk_write)
> + add %rdx, %rsi
> + add %rdx, %rdi
> + BRANCH_TO_JMPTBL_ENTRY (L(table_48bytes_fwd), %rdx, 4)
> +L(bk_write):
> + BRANCH_TO_JMPTBL_ENTRY (L(table_48_bytes_bwd), %rdx, 4)

Would it make sense to also purge this 'bk_write' and 'table_48_bytes_bwd' 
stuff?

> diff --git a/pixman/pixman-cpu.c b/pixman/pixman-cpu.c
> index e4fb1e4..ab2d69a 100644

[...]
 
> +#else /* end dt_cpu32() */
> +
> +static unsigned int
> +detect_cpu_features (void)
> +{
> +unsigned int features = 0;
> +unsigned int result = 0;
> +unsigned int result_c = 0;

This looks like a duplication of the code to make a new 64-bit version of 
'detect_cpu_features' function. Could we assume that all 64-bit processors 
support CPUID instruction (so that the check for CPUID support can be omitted), 
and also support MMX and SSE2? In this case SSSE3 check will be just a tiny 
block of inline assembly using CPUID instruction.

> diff --git a/pixman/pixman-sse2.c b/pixman/pixman-sse2.c
> index 3dd7967..0cfd554 100644
> --- a/pixman/pixman-sse2.c
> +++ b/pixman/pixman-sse2.c
> @@ -3520,6 +3520,56 @@ sse2_composite_over__n_ 
(pixman_implementation_t *imp,
>  /*-
>   * composite_over__n_
>   */
> +#if defined(USE_SSSE3) && !defined(_MSC_VER)
> +
> +extern void *composite_src_x888__ssse3_fast_path( uint32_t * dest,  
uint32_t *src,  int32_t count);
> +extern pixman_bool_t choose_ssse3_fast_patch;
> +static void
> +ssse3_composite_src_x888_ (pixman_implementation_t *imp,

The addition of 'ssse3_composite_src_x888_' into 'pixman-sse2.c' does not 
look like a good idea. Adding a new 'pixman-ssse3.c' file for SSSE3 
optimizations would have been a better choice.

Especially considering the following part of code:

> +#if defined(USE_SSSE3) && !defined(_MSC_VER)
> +PIXMAN_STD_FAST_PATH (SRC, x8r8g8b8, null, a8r8g8b8, 
ssse3_composite_src_x888_),
> +PIXMAN_STD_FAST_PATH (SRC, x8b8g8r8, null, a8b8g8r8, 
ssse3_composite_src_x888_),
> +#else
>  PIXMAN_STD_FAST_PATH (SRC, x8r8g8b8, null, a8r8g8b8, 
sse2_composite_src_x888_),
>  PIXMAN_STD_FAST_PATH (SRC, x8b8g8r8, null, a8b8g8r8, 
sse2_composite_src_x888_),
> +#endif

By using "#if defined(USE_SSSE3)" here, you effectively disable 
'sse2_composite_src_x888_' fast path for the processors which do not have 
SSSE3 support. If the runtime detection of SSSE3 fails, the code fallbacks to C 
implementation and we have no chance to benefit from SSE2 optimizations.

-- 
Best regards,
Siarhei Siamashka


signature.asc
Description: This is a digitally signed message part.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-21 Thread Siarhei Siamashka
On Friday 20 August 2010 19:36:07 Xu, Samuel wrote:
> We measured performance, and compared with original SSE2 intrinsic enabled
> version(0.19.4), on ATOM, and get following findings using 480P flash
> H.264 video playing workload:
> 1) sse2_composite_src_x888_()'s cycle reduced 67%. This function's total
> cycle ratio over whole system reduced from 5.6% to 1.9%

This is not directly related to your pixman patch, but looks like the right 
place to fix the performance problem in your flash use case is YUV->RGB 
conversion. Setting alpha channel to 0xFF there would be the most efficient and 
src_x888_ operation could be totally eliminated.

That said, improving pixman performance in general is still welcome and is a 
nice thing to have (for the other common use cases at least).

I see that you dropped ssse3 x8r8g8b8 fetcher optimization in your last patch 
and added ssse3 src_x888_ fast path instead. It's a bit sad, because other 
operations like over__x888 with bilinear scaling could also benefit from 
the optimized fetcher for example. On the other hand, the lack of clarity 
regarding how to add SIMD optimized fetchers is avoided this way ;)

> 2) whole system's C0 percentage reduced from 68.0% to 62.6%
> Maybe it is not " dramatically", while we are glad to see those gain on
> both perf and power.

A peformance gain in the 4-5% ballpark looks like a major improvement to me. 

-- 
Best regards,
Siarhei Siamashka


signature.asc
Description: This is a digitally signed message part.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-20 Thread Xu, Samuel
Hi, Soeren Sandmann and Siarhei Siamashka:
This patch is still for #20709. 
We already aware there are some SSE2 intrinsic is trying to addressing 
similar performance issue in recent new commit. Where this intrinsic 
optimization for x8r8g8b8 ops is not good enough, and sometimes even worse, on 
ATOM. 
In this patch, we provide SSSE3 optimized code, which co-exists with 
current SSE2 intrinsic, will effect when expected SSSE3 CPUID detected.
Considering Siarhei Siamashka's suggestions, here are some enhancements 
comparing with pervious patch:
1) 32 bit and 64 bit assemble code of highly optimized memory move + 
logical AND, using SSSE3
2) SSSE3 runtime detection
3) Stack executable issue is avoided
4) make asm code shorter (almost half of pervious patch)

We measured performance, and compared with original SSE2 intrinsic enabled 
version(0.19.4), on ATOM, and get following findings using 480P flash H.264 
video playing workload:
1) sse2_composite_src_x888_()'s cycle reduced 67%. This function's 
total cycle ratio over whole system reduced from 5.6% to 1.9%
2) whole system's C0 percentage reduced from 68.0% to 62.6%
Maybe it is not " dramatically", while we are glad to see those gain on both 
perf and power.

BTW, we build and ran make check on following 3 systems:
1: 32 bit ATOM system, with SSSE3
2: 64 bit ATOM system, with SSSE3
3: 64 bit system, without SSSE3

Thanks!
Samuel

-Original Message-
From: Liu, Xinyun [mailto:xinyun...@gmail.com] 
Sent: Friday, August 20, 2010 11:40 PM
To: Siarhei Siamashka; pixman@lists.freedesktop.org
Cc: Ma, Ling; Xu, Samuel
Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Hi Siarhei Siamashka,

Here is a new patch, can you review it? Thank you!

With this patch, opfile said that the performance is increased dramatically for 
Atom.

Samuel and Ling will provide detailed data.

Regards,
Liu, Xinyun
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-20 Thread Liu, Xinyun
Hi Siarhei Siamashka,

Here is a new patch, can you review it? Thank you!

With this patch, opfile said that the performance is increased
dramatically for Atom.

Samuel and Ling will provide detailed data.

Regards,
Liu, Xinyun


0001-Add-ssse3_composite_src_x888_.patch
Description: Binary data
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-19 Thread Siarhei Siamashka
On Tuesday 17 August 2010 10:52:43 Xu, Samuel wrote:
> We'd like to provide a new patch with following enhancement soon:
> 1) Add 64 bit asm code specifically for 64 bit, which will co-exist with 32
> bit version 
> 2) CPUID dynamic check in pixman-cpu.c and pixman-access.c
> 3) Makefile fixing
> 4) Stack executable fixing in both 32/64 asm files.
> 5) follow your comments to make asm code shorter

Sounds good. The way how these optimizations get hooked into 'pixman-access.c' 
may need some extra tweaks in the end (they need to be approved by Soeren 
Sandmann), but the rest of the changes seem quite reasonable to me.

I still wonder, would a common unified 32/64 bit SSSE3 assembly source be 
possible? Calling conventions are different, but the rest of code should be 
more or less the same.

> Any missing?

Also a proper copyright notice in the beginning of new files is needed.
The text of the current recommended canonical pixman license can be found here: 
 
http://cgit.freedesktop.org/pixman/tree/COPYING

> Two more questions:
> "Maybe the intention was good, but the end result just fails to build
> on my 64-bit Intel Atom netbook. And pixman "make check" dies with
> "Illegal instruction" on a 32-bit Intel Pentium-M laptop. So it's a clear
> indication that you did something wrong and the patch is unacceptable as
> is."
> 
> Q1: Could you past the 64-bit ATOM build failure? Is it caused by
> ./configure to generate make file, or build error, or link error?

It's just a normal x86-64 build failure related to pixman-access-ssse3.S:

$ git clone git://anongit.freedesktop.org/pixman
$ cd pixman
$ patch -p1 < first_revision_of_ssse3_patch.diff

$ ./autogen.sh
[...]
checking whether to use SSSE3 intrinsics... yes
[...]

$ make
[...]
  CPPAS  pixman-access-ssse3.lo
pixman-access-ssse3.S: Assembler messages:
pixman-access-ssse3.S:118: Error: suffix or operands invalid for `push'
pixman-access-ssse3.S:118: Error: bad register expression
pixman-access-ssse3.S:118: Error: register save offset not a multiple of 8
pixman-access-ssse3.S:130: Error: suffix or operands invalid for `jmp'
pixman-access-ssse3.S:132: Error: suffix or operands invalid for `jmp'
pixman-access-ssse3.S:138: Error: suffix or operands invalid for `push'
pixman-access-ssse3.S:138: Error: bad register expression
pixman-access-ssse3.S:143: Error: suffix or operands invalid for `push'
pixman-access-ssse3.S:143: Error: bad register expression
[...]

> Q2:
> Could you show us how to do "make check"?

Same as above, but also run 'make check' after 'make'.

To be sure that the tests are working properly and the new code is actually 
used, you may deliberately add a bug into your new code for testing purposes 
and verify that 'make check' detects it.

It may be also convenient to configure pixman with '--disable-shared' option to 
make debugging easier.

-- 
Best regards,
Siarhei Siamashka


signature.asc
Description: This is a digitally signed message part.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-17 Thread Xu, Samuel
Hi Siarhei Siamashka:
We'd like to provide a new patch with following enhancement soon:
1) Add 64 bit asm code specifically for 64 bit, which will co-exist with 32 bit 
version
2) CPUID dynamic check in pixman-cpu.c and pixman-access.c
3) Makefile fixing
4) Stack executable fixing in both 32/64 asm files.
5) follow your comments to make asm code shorter
Any missing?

Two more questions:
"Maybe the intention was good, but the end result just fails to build on my 
64-bit Intel Atom netbook. And pixman "make check" dies with "Illegal 
instruction" on a 32-bit Intel Pentium-M laptop. So it's a clear indication 
that you did something wrong and the patch is unacceptable as is."

Q1: Could you past the 64-bit ATOM build failure? Is it caused by ./configure 
to generate make file, or build error, or link error?
Q2: Could you show us how to do "make check"?

Samuel


-Original Message-
From: Ma, Ling 
Sent: Tuesday, August 17, 2010 3:22 PM
To: Siarhei Siamashka; Xu, Samuel
Cc: pixman@lists.freedesktop.org; Liu, Xinyun
Subject: RE: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Hi Siarhei Siamashka

> I did not question the fact that PALIGNR instruction must be useful 
> for something. There must have been some reason why it was introduced 
> :) But real benchmark numbers are always interesting, like the ones 
> you provided for
> SSSE3 against original C code comparison. Just to be sure that these 
> more than 1000 lines of code are really justified (at least for this 
> particular use).
We attached data report about movups + movaps Vs movaps + palignr. 

The report shows movaps + palignr is better because crossing-cacheline will 
cause big latency on Atom, Core2.

> As I already mentioned, the code in 'pixman-access-ssse3.S' is highly 
> redundant just for 'fetch_scanline_x8r8g8b8' implementation. Due to 
> processing 32-bit data, there are only 4 possible cases of relative 
> source/destination alignment, but the whole set of 16 alignment cases 
> is implemented. In addition to other benefits (like improved sources 
> readability), reducing code size is good for saving space in the CPU 
> instruction cache and this typically improves performance.

Original patch can tolerate any type of offset from src and dest. 
Yes, I agree if offset of src and dest must be 4 bytes aligned, Your suggestion 
will reduce code size significantly.

Other questions, Samuel or Xinyun will provide further answers.

Best Regards
Ling


> > 2) For #ifdefs vs. dynamically checking SSSE3, the patch exactly 
> > mimics "SSE2" support in currently pixman code(#ifdefs), which 
> > checks the build system when makefile generating and applies
> > USE_SSSE3 into later building.
> 
> Maybe the intention was good, but the end result just fails to build 
> on my 64-bit Intel Atom netbook. And pixman "make check" dies with 
> "Illegal instruction" on a 32-bit Intel Pentium-M laptop. So it's a 
> clear indication that you did something wrong and the patch is unacceptable 
> as is.
> 
> > Of course prefix of CPUID can achieve better compatibility. It is ok 
> > for us to add this kind of CPUID prefix. I will appreciate directly 
> > comments on shape of this patch, using SSSE3, e.g. some code sample 
> > of expected pixman-cpu.c and
> pixman-access.c.
> 
> SSSE3 is not much different from MMX and SSE2 and can be added in a 
> similar way to pixman-cpu.c
> 
> Regarding SIMD optimized fetchers. Pixman has none at the moment, so 
> there is no code to be used as a reference to see how it should be 
> done "right" (and that's partially the reason why bug#20709 is still 
> unresolved).
> There was an older discussion about SIMD fetchers:
> http://comments.gmane.org/gmane.comp.lib.cairo/18342
> 
> > 3) For executable stack caused by assemble file, it should be able 
> > to solved by adding ".section .note.GNU-stack .previous", according 
> > https://www.redhat.com/archives/fedora-devel-list/2005-March/msg00460.
> > html . In fact, same solution already exist in current 
> > pixman-arm-neon-asm.S inside pixman code.
> 
> Yes sure. I did not say that it can't be solved :) It's just better to 
> address this particular problem in the next revision of ssse3 patch.
> 
> --
> Best regards,
> Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-16 Thread Siarhei Siamashka
On Monday 16 August 2010 11:24:44 Xu, Samuel wrote:
>   Thanks for kindly comments! It is very nice that bug#20709 also 
emphasize
> similar performance issue.

Well, having bugs unresolved for such a long time is not so nice. Also see some 
more comments below.

> So, let's discuss how to make this patch
> better. Here is some answers from me, and Ma Ling will update comments on
> 32/64 bit question and assemble details :
> 
> 1) For MOVAPS+PALIGNR from SSSE3, referring chapter 12 of  IA-32 Architectures Optimization Reference Manual> at
> http://www.intel.com/Assets/PDF/manual/248966.pdf, we can know
> MOVAPS+PALIGNR is preferred in ATOM: "Assembly /Compiler Coding Rule 13.
> (MH impact, M generality) For Intel Atom processors, prefer a sequence
> MOVAPS+PALIGN over MOVUPS. Similarly, MOVDQA+PALIGNR is preferred over
> MOVDQU." As far as I know, this code pattern is friendly for ATOM and
> won't harm Core-i7.

I did not question the fact that PALIGNR instruction must be useful for 
something. There must have been some reason why it was introduced :) But real 
benchmark numbers are always interesting, like the ones you provided for SSSE3 
against original C code comparison. Just to be sure that these more than 1000 
lines of code are really justified (at least for this particular use).

As I already mentioned, the code in 'pixman-access-ssse3.S' is highly redundant 
just for 'fetch_scanline_x8r8g8b8' implementation. Due to processing
32-bit data, there are only 4 possible cases of relative source/destination 
alignment, but the whole set of 16 alignment cases is implemented. In addition 
to other benefits (like improved sources readability), reducing code size is 
good for saving space in the CPU instruction cache and this typically improves 
performance.

> 2) For #ifdefs vs. dynamically checking SSSE3, the
> patch exactly mimics "SSE2" support in currently pixman code(#ifdefs),
> which checks the build system when makefile generating and applies
> USE_SSSE3 into later building.

Maybe the intention was good, but the end result just fails to build on my
64-bit Intel Atom netbook. And pixman "make check" dies with "Illegal 
instruction" on a 32-bit Intel Pentium-M laptop. So it's a clear indication 
that you did something wrong and the patch is unacceptable as is.

> Of course prefix of CPUID can achieve
> better compatibility. It is ok for us to add this kind of CPUID prefix. I
> will appreciate directly comments on shape of this patch, using SSSE3,
> e.g. some code sample of expected pixman-cpu.c and pixman-access.c.

SSSE3 is not much different from MMX and SSE2 and can be added in a similar way 
to pixman-cpu.c

Regarding SIMD optimized fetchers. Pixman has none at the moment, so there is 
no code to be used as a reference to see how it should be done "right" (and 
that's partially the reason why bug#20709 is still unresolved).
There was an older discussion about SIMD fetchers: 
http://comments.gmane.org/gmane.comp.lib.cairo/18342 

> 3) For executable stack caused by assemble file, it should be able to solved
> by adding ".section .note.GNU-stack .previous", according
> https://www.redhat.com/archives/fedora-devel-list/2005-March/msg00460.html
> . In fact, same solution already exist in current pixman-arm-neon-asm.S
> inside pixman code.

Yes sure. I did not say that it can't be solved :) It's just better to address 
this particular problem in the next revision of ssse3 patch.

-- 
Best regards,
Siarhei Siamashka


signature.asc
Description: This is a digitally signed message part.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-16 Thread Xu, Samuel
Hi, Siamashka:
Thanks for kindly comments! It is very nice that bug#20709 also 
emphasize similar performance issue.
So, let's discuss how to make this patch better. Here is some answers 
from me, and Ma Ling will update comments on 32/64 bit question and assemble 
details :

1) For MOVAPS+PALIGNR from SSSE3, referring chapter 12 of  at 
http://www.intel.com/Assets/PDF/manual/248966.pdf, we can know MOVAPS+PALIGNR 
is preferred in ATOM: "Assembly/Compiler Coding Rule 13. (MH impact, M 
generality) For Intel Atom processors, prefer a sequence MOVAPS+PALIGN over 
MOVUPS. Similarly, MOVDQA+PALIGNR is preferred over MOVDQU." As far as I know, 
this code pattern is friendly for ATOM and won't harm Core-i7.
2) For #ifdefs vs. dynamically checking SSSE3, the patch exactly mimics "SSE2" 
support in currently pixman code(#ifdefs), which checks the build system when 
makefile generating and applies USE_SSSE3 into later building. Of course prefix 
of CPUID can achieve better compatibility. It is ok for us to add this kind of 
CPUID prefix. I will appreciate directly comments on shape of this patch, using 
SSSE3, e.g. some code sample of expected pixman-cpu.c and pixman-access.c.
3) For executable stack caused by assemble file, it should be able to solved by 
adding ".section .note.GNU-stack .previous", according 
https://www.redhat.com/archives/fedora-devel-list/2005-March/msg00460.html. In 
fact, same solution already exist in current pixman-arm-neon-asm.S inside 
pixman code.

Thanks!
Samuel

-Original Message-
From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] 
Sent: Saturday, August 14, 2010 1:54 AM
To: pixman@lists.freedesktop.org
Cc: Liu, Xinyun; Ma, Ling; Xu, Samuel
Subject: Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

On Wednesday 11 August 2010 09:00:54 Liu Xinyun wrote:
> Hi,
> 
> piman-access.c: fetch_scanline_x8r8g8b8() is mainly memcpy and 'or"
> operations. With ssse3_memcpy, the performance is increased a little.
> 
> Reference: http://bugs.meego.com/show_bug.cgi?id=5012
> 
> Quote:
> > After optimization, we can find following speed up:
> > 1.fetch_scanline_x8r8g8b8()'s cycles is reduced 84% during clips run.
> > It now consumes 0.8% of whole system, comparing with 2.3% w/o
> > optimization. 2.system C0 is reduced around 1.5%
> > 3.Global CPU utilization is reduced around 1.5% of whole system
> 
> Here is patch, please review it, thanks!

Thanks for the patch. This performance problem has been known for some time 
already and there is a bug about it (with some useful comments):
https://bugs.freedesktop.org/show_bug.cgi?id=20709

Out of curiosity, how much of the performance gain is provided by using 
MOVAPS+PALIGNR from SSSE3 over just a plain SSE2 implementation via 
MOVAPS+MOVUPS
on Intel Atom? Is it also beneficial on other processors supporting SSSE3 such 
as Intel Core i7?

That said, I surely appreciate the intent of getting the absolutely best 
performance whenever it is possible. I also tried to find the best possible 
memory access pattern myself for ARM NEON on OMAP3430 SoC, and then generalize 
the code to support many types of different compositing operations (in 
'pixman-arm-neon-asm.*').

Now back to your patch. It is not a compete review, I just wanted to comment a 
few things which stick out at the moment.

As I see, the biggest problems in it are:

1. This patch introduces pixman build failure on 64-bit systems, because 
'pixman-access-ssse3.S' is not 64-bit aware and makes a lot of 32-bit 
assumptions.

2. Runtime detection of SSSE3 should be used instead of #ifdefs in the code. 
Right now, applying this patch will just make default pixman build unusable on 
the processors which do not support SSSE3.

3. Linking 'pixman-access-ssse3.o' object file with the others enforces the use 
of executable stack, which is a bad thing. More details can be found here for 
example:
http://www.gentoo.org/proj/en/hardened/gnu-stack.xml

Less important things spotted by reading the code are below.

> From 30514afe4d0af3a476ec8dcea494f7e216f0fc9d Mon Sep 17 00:00:00 2001
> From: Liu Xinyun 
> Date: Wed, 11 Aug 2010 16:31:47 +0800
> Subject: [PATCH] Optimize fetch_scanline_x8r8g8b8 with SSSE3 
> instruction
> 
> Signed-off-by: Liu Xinyun 
> Signed-off-by: Xu, Samuel 
> Signed-off-by: Ma, ling 
> ---
>  configure.ac |   56 +++
>  pixman/Makefile.am   |   14 +
>  pixman/pixman-access-ssse3.S | 1119
> ++ pixman/pixman-access.c   | 
>   6 +
>  4 files changed, 1195 insertions(+), 0 deletions(-)  create mode 
> 100644 pixman/pixman-access-ssse3.S

[...]
 
> + .section .text.ssse3,"ax",@progbits
> +ENTRY (MEMCPY_OR)
> + ENTRANCE
> + movlLEN(%esp)

Re: [Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

2010-08-13 Thread Siarhei Siamashka
On Wednesday 11 August 2010 09:00:54 Liu Xinyun wrote:
> Hi,
> 
> piman-access.c: fetch_scanline_x8r8g8b8() is mainly memcpy and 'or"
> operations. With ssse3_memcpy, the performance is increased a little.
> 
> Reference: http://bugs.meego.com/show_bug.cgi?id=5012
> 
> Quote:
> > After optimization, we can find following speed up:
> > 1.fetch_scanline_x8r8g8b8()'s cycles is reduced 84% during clips run.
> > It now consumes 0.8% of whole system, comparing with 2.3% w/o
> > optimization. 2.system C0 is reduced around 1.5%
> > 3.Global CPU utilization is reduced around 1.5% of whole system
> 
> Here is patch, please review it, thanks!

Thanks for the patch. This performance problem has been known for some time
already and there is a bug about it (with some useful comments):
https://bugs.freedesktop.org/show_bug.cgi?id=20709

Out of curiosity, how much of the performance gain is provided by using 
MOVAPS+PALIGNR from SSSE3 over just a plain SSE2 implementation via MOVUPS
on Intel Atom? Is it also beneficial on other processors supporting SSSE3
such as Intel Core i7?

That said, I surely appreciate the intent of getting the absolutely best
performance whenever it is possible. I also tried to find the best possible
memory access pattern myself for ARM NEON on OMAP3430 SoC, and then generalize
the code to support many types of different compositing operations (in
'pixman-arm-neon-asm.*').

Now back to your patch. It is not a compete review, I just wanted to comment a
few things which stick out at the moment.

As I see, the biggest problems in it are:

1. This patch introduces pixman build failure on 64-bit systems, because
'pixman-access-ssse3.S' is not 64-bit aware and makes a lot of 32-bit 
assumptions.

2. Runtime detection of SSSE3 should be used instead of #ifdefs in the code. 
Right now, applying this patch will just make default pixman build unusable on 
the processors which do not support SSSE3.

3. Linking 'pixman-access-ssse3.o' object file with the others enforces the
use of executable stack, which is a bad thing. More details can be found here 
for example:
http://www.gentoo.org/proj/en/hardened/gnu-stack.xml

Less important things spotted by reading the code are below.

> From 30514afe4d0af3a476ec8dcea494f7e216f0fc9d Mon Sep 17 00:00:00 2001
> From: Liu Xinyun 
> Date: Wed, 11 Aug 2010 16:31:47 +0800
> Subject: [PATCH] Optimize fetch_scanline_x8r8g8b8 with SSSE3 instruction
> 
> Signed-off-by: Liu Xinyun 
> Signed-off-by: Xu, Samuel 
> Signed-off-by: Ma, ling 
> ---
>  configure.ac |   56 +++
>  pixman/Makefile.am   |   14 +
>  pixman/pixman-access-ssse3.S | 1119
> ++ pixman/pixman-access.c   | 
>   6 +
>  4 files changed, 1195 insertions(+), 0 deletions(-)
>  create mode 100644 pixman/pixman-access-ssse3.S

[...]
 
> + .section .text.ssse3,"ax",@progbits
> +ENTRY (MEMCPY_OR)
> + ENTRANCE
> + movlLEN(%esp), %ecx
> + movlSRC(%esp), %eax
> + movlDEST(%esp), %edx
> +
> + cmp $48, %ecx
> + jae L(48bytesormore)
> +
> + cmp %dl, %al
> + jb  L(bk_write)
> + add %ecx, %edx
> + add %ecx, %eax
> + BRANCH_TO_JMPTBL_ENTRY (L(table_48bytes_fwd), %ecx, 4)
> +L(bk_write):
> + BRANCH_TO_JMPTBL_ENTRY (L(table_48_bytes_bwd), %ecx, 4)

Backwards copy is not strictly needed for 'fetch_scanline_x8r8g8b8' at the 
moment. And probably will never be because scanlines get fetched into a 
temporary buffer.

> +
> + ALIGN (4)
> +/* ECX > 32 and EDX is 4 byte aligned.  */
> +L(48bytesormore):
> + movdqu  (%eax), %xmm0
> + PUSH (%edi)
> + mov $0xff00, %edi
> + movd%edi, %xmm6
> + movl%edx, %edi
> + and $-16, %edx
> + PUSH (%esi)
> + add $16, %edx
> + movl%edi, %esi
> + sub %edx, %edi
> + add %edi, %ecx
> + sub %edi, %eax
> +
> + mov %esi, %edi
> + pshufd  $0, %xmm6, %xmm6

> + and $3, %edi
> + por %xmm6, %xmm0
> + jz  L(aligned4bytes)
> + cmp $3, %edi
> + psrldq  $1, %xmm6
> + jz  L(aligned4bytes)
> + cmp $2, %edi
> + psrldq  $1, %xmm6
> + jz  L(aligned4bytes)
> + psrldq  $1, %xmm6

Is there any way for the value in EDI register to be not a multiple of 4 here?
Both source and destination are guaranteed to be 4 bytes aligned because they 
are pointers to uint32_t.

This seems to be a bit redundant here (it looks like a leftover from 
memcpy/memmove implementation which was used as a templete for this code). The 
negative impact is a slight runtime overhead (especially for short scanlines) 
and bigger code size.

[...]

> L(shl_1):

[...]

> L(shl_1_loop):

[...]

> L(shl_2):

[...]

> L(shl_2_loop):

[...]

The rest of code contains a lot of repeatable patterns (the main loop is 
replicated 16 times for different alignments). IMHO they could be
simplified a lot by using macros, making the code si