[Freerdp-devel] Update on SSE2 for RemoteFX

2011-06-15 Thread S. Erisman
I finished adding SSE2 optimizations for the Inverse DWT decoding 
routines this evening.


Here are the current performance numbers from my Atom D510 test system:

Without SSE:
 |---|
PROFILER |elapsed seconds|
||---|
| code section  | iterations | total |  avg. |
|---||---|---|
| rfx_decode_rgb|  57385 | 54.53 |  0.000950 |
| rfx_decode_component  | 172155 | 42.12 |  0.000245 |
| rfx_rlgr_decode   | 172155 | 10.56 |  0.61 |
| rfx_differential_decode   | 172155 |  0.24 |  0.01 |
| rfx_quantization_decode   | 172155 |  3.98 |  0.23 |
| rfx_dwt_2d_decode | 172155 | 26.25 |  0.000152 |
| rfx_decode_YCbCr_to_RGB   |  57385 | 10.26 |  0.000179 |
||

With SSE:
 |---|
PROFILER |elapsed seconds|
||---|
| code section  | iterations | total |  avg. |
|---||---|---|
| rfx_decode_rgb|  47871 | 20.00 |  0.000418 |
| rfx_decode_component  | 143613 | 17.01 |  0.000118 |
| rfx_rlgr_decode   | 143613 | 12.23 |  0.85 |
| rfx_differential_decode   | 143613 |  0.15 |  0.01 |
| rfx_quantization_decode_SSE2  | 143613 |  0.73 |  0.05 |
| rfx_dwt_2d_decode_SSE2| 143613 |  3.06 |  0.21 |
| rfx_decode_YCbCr_to_RGB_SSE2  |  47871 |  1.02 |  0.21 |
||

As you can see, we are currently getting a little more than 100% 
performance gain by using SSE.  It is noticeably faster and more 
responsive as well.  Looking at just the SSE vs. non-SSE methods we are 
getting  500% improvement.


Running the numbers through a calculation (accounting for some of these 
methods being called more than others) gives this break-down:

61.00%  rlgr
0.72%   diff
3.59%   quant (sse)
15.07%  dwt (sse)
5.02%   ycbcr (sse)
14.59%  other


So, the one large remaining non-SSE method (rfx_rlgr_decode) is 
accounting for about 61% (85*3 / 418) of the total RemoteFX processing 
time currently.  This method might be hard to optimized using SSE, 
however, as it appears to be more stream/logic based than 
loop/calculation based.  It is definitely worth taking a further look 
at, however, to see if there are other optimizations that can be made.


It might also be worth taking a look at the 'other' category.  I assume 
this includes the final assembly of the RGB data into it's output 
format.  This might be able to be optimized using SSE still.


FYI... I probably won't be able to push updates quite as fast over the 
next 2 weeks, as we are at the end of a large project at work that is 
requiring extra effort to get across the finish line.  I would still 
like to see if there is any more performance we can get out of this code 
though.  If someone on the list has SSE optimization experience, I would 
love a code review... particularly around order of operations and cache 
usage.  We might be able to get another couple % improvement with some 
very minor changes.


Lastly... I should get my new AMD Zacate based board tomorrow.  Over the 
next couple of weeks, I want to take a stab at an alternate OpenCL 
accelerated version of this RemoteFX code as well.  Any other interest 
or experience in this type of acceleration?


Thanks,
 Steve
--
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev___
Freerdp-devel mailing list
Freerdp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freerdp-devel


[Freerdp-devel] Update on SSE2 for RemoteFX

2011-06-15 Thread S. Erisman
I finished adding SSE2 optimizations for the Inverse DWT decoding 
routines this evening.


Here are the current performance numbers from my Atom D510 test system:

Without SSE:
 |---|
PROFILER |elapsed seconds|
||---|
| code section  | iterations | total |  avg. |
|---||---|---|
| rfx_decode_rgb|  57385 | 54.53 |  0.000950 |
| rfx_decode_component  | 172155 | 42.12 |  0.000245 |
| rfx_rlgr_decode   | 172155 | 10.56 |  0.61 |
| rfx_differential_decode   | 172155 |  0.24 |  0.01 |
| rfx_quantization_decode   | 172155 |  3.98 |  0.23 |
| rfx_dwt_2d_decode | 172155 | 26.25 |  0.000152 |
| rfx_decode_YCbCr_to_RGB   |  57385 | 10.26 |  0.000179 |
||

With SSE:
 |---|
PROFILER |elapsed seconds|
||---|
| code section  | iterations | total |  avg. |
|---||---|---|
| rfx_decode_rgb|  47871 | 20.00 |  0.000418 |
| rfx_decode_component  | 143613 | 17.01 |  0.000118 |
| rfx_rlgr_decode   | 143613 | 12.23 |  0.85 |
| rfx_differential_decode   | 143613 |  0.15 |  0.01 |
| rfx_quantization_decode_SSE2  | 143613 |  0.73 |  0.05 |
| rfx_dwt_2d_decode_SSE2| 143613 |  3.06 |  0.21 |
| rfx_decode_YCbCr_to_RGB_SSE2  |  47871 |  1.02 |  0.21 |
||

As you can see, we are currently getting a little more than 100% 
performance gain by using SSE.  It is noticeably faster and more 
responsive as well.  Looking at just the SSE vs. non-SSE methods we are 
getting  500% improvement.


Running the numbers through a calculation (accounting for some of these 
methods being called more than others) gives this break-down:

61.00%  rlgr
0.72%   diff
3.59%   quant (sse)
15.07%  dwt (sse)
5.02%   ycbcr (sse)
14.59%  other


So, the one large remaining non-SSE method (rfx_rlgr_decode) is 
accounting for about 61% (85*3 / 418) of the total RemoteFX processing 
time currently.  This method might be hard to optimized using SSE, 
however, as it appears to be more stream/logic based than 
loop/calculation based.  It is definitely worth taking a further look 
at, however, to see if there are other optimizations that can be made.


It might also be worth taking a look at the 'other' category.  I assume 
this includes the final assembly of the RGB data into it's output 
format.  This might be able to be optimized using SSE still.


FYI... I probably won't be able to push updates quite as fast over the 
next 2 weeks, as we are at the end of a large project at work that is 
requiring extra effort to get across the finish line.  I would still 
like to see if there is any more performance we can get out of this code 
though.  If someone on the list has SSE optimization experience, I would 
love a code review... particularly around order of operations and cache 
usage.  We might be able to get another couple % improvement with some 
very minor changes.


Lastly... I should get my new AMD Zacate based board tomorrow.  Over the 
next couple of weeks, I want to take a stab at an alternate OpenCL 
accelerated version of this RemoteFX code as well.  Any other interest 
or experience in this type of acceleration?


Thanks,
 Steve
--
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev___
Freerdp-devel mailing list
Freerdp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freerdp-devel