I finished adding SSE2 optimizations for the Inverse DWT decoding routines this evening.

Here are the current performance numbers from my Atom D510 test system:

Without SSE:
                                             |-----------------------|
                PROFILER                     |    elapsed seconds    |
|--------------------------------------------|-----------------------|
| code section                  | iterations |     total |      avg. |
|-------------------------------|------------|-----------|-----------|
| rfx_decode_rgb                |      57385 | 54.530000 |  0.000950 |
| rfx_decode_component          |     172155 | 42.120000 |  0.000245 |
| rfx_rlgr_decode               |     172155 | 10.560000 |  0.000061 |
| rfx_differential_decode       |     172155 |  0.240000 |  0.000001 |
| rfx_quantization_decode       |     172155 |  3.980000 |  0.000023 |
| rfx_dwt_2d_decode             |     172155 | 26.250000 |  0.000152 |
| rfx_decode_YCbCr_to_RGB       |      57385 | 10.260000 |  0.000179 |
|--------------------------------------------------------------------|

With SSE:
                                             |-----------------------|
                PROFILER                     |    elapsed seconds    |
|--------------------------------------------|-----------------------|
| code section                  | iterations |     total |      avg. |
|-------------------------------|------------|-----------|-----------|
| rfx_decode_rgb                |      47871 | 20.000000 |  0.000418 |
| rfx_decode_component          |     143613 | 17.010000 |  0.000118 |
| rfx_rlgr_decode               |     143613 | 12.230000 |  0.000085 |
| rfx_differential_decode       |     143613 |  0.150000 |  0.000001 |
| rfx_quantization_decode_SSE2  |     143613 |  0.730000 |  0.000005 |
| rfx_dwt_2d_decode_SSE2        |     143613 |  3.060000 |  0.000021 |
| rfx_decode_YCbCr_to_RGB_SSE2  |      47871 |  1.020000 |  0.000021 |
|--------------------------------------------------------------------|

As you can see, we are currently getting a little more than 100% performance gain by using SSE. It is noticeably faster and more responsive as well. Looking at just the SSE vs. non-SSE methods we are getting > 500% improvement.

Running the numbers through a calculation (accounting for some of these methods being called more than others) gives this break-down:
61.00%  rlgr
0.72%   diff
3.59%   quant (sse)
15.07%  dwt (sse)
5.02%   ycbcr (sse)
14.59%  other


So, the one large remaining non-SSE method (rfx_rlgr_decode) is accounting for about 61% (85*3 / 418) of the total RemoteFX processing time currently. This method might be hard to optimized using SSE, however, as it appears to be more stream/logic based than loop/calculation based. It is definitely worth taking a further look at, however, to see if there are other optimizations that can be made.

It might also be worth taking a look at the 'other' category. I assume this includes the final assembly of the RGB data into it's output format. This might be able to be optimized using SSE still.

FYI... I probably won't be able to push updates quite as fast over the next 2 weeks, as we are at the end of a large project at work that is requiring extra effort to get across the finish line. I would still like to see if there is any more performance we can get out of this code though. If someone on the list has SSE optimization experience, I would love a code review... particularly around order of operations and cache usage. We might be able to get another couple % improvement with some very minor changes.

Lastly... I should get my new AMD Zacate based board tomorrow. Over the next couple of weeks, I want to take a stab at an alternate OpenCL accelerated version of this RemoteFX code as well. Any other interest or experience in this type of acceleration?

Thanks,
 Steve
------------------------------------------------------------------------------
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
_______________________________________________
Freerdp-devel mailing list
Freerdp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freerdp-devel

Reply via email to