[Freerdp-devel] Update on SSE2 for RemoteFX
I finished adding SSE2 optimizations for the Inverse DWT decoding routines this evening. Here are the current performance numbers from my Atom D510 test system: Without SSE: |---| PROFILER |elapsed seconds| ||---| | code section | iterations | total | avg. | |---||---|---| | rfx_decode_rgb| 57385 | 54.53 | 0.000950 | | rfx_decode_component | 172155 | 42.12 | 0.000245 | | rfx_rlgr_decode | 172155 | 10.56 | 0.61 | | rfx_differential_decode | 172155 | 0.24 | 0.01 | | rfx_quantization_decode | 172155 | 3.98 | 0.23 | | rfx_dwt_2d_decode | 172155 | 26.25 | 0.000152 | | rfx_decode_YCbCr_to_RGB | 57385 | 10.26 | 0.000179 | || With SSE: |---| PROFILER |elapsed seconds| ||---| | code section | iterations | total | avg. | |---||---|---| | rfx_decode_rgb| 47871 | 20.00 | 0.000418 | | rfx_decode_component | 143613 | 17.01 | 0.000118 | | rfx_rlgr_decode | 143613 | 12.23 | 0.85 | | rfx_differential_decode | 143613 | 0.15 | 0.01 | | rfx_quantization_decode_SSE2 | 143613 | 0.73 | 0.05 | | rfx_dwt_2d_decode_SSE2| 143613 | 3.06 | 0.21 | | rfx_decode_YCbCr_to_RGB_SSE2 | 47871 | 1.02 | 0.21 | || As you can see, we are currently getting a little more than 100% performance gain by using SSE. It is noticeably faster and more responsive as well. Looking at just the SSE vs. non-SSE methods we are getting 500% improvement. Running the numbers through a calculation (accounting for some of these methods being called more than others) gives this break-down: 61.00% rlgr 0.72% diff 3.59% quant (sse) 15.07% dwt (sse) 5.02% ycbcr (sse) 14.59% other So, the one large remaining non-SSE method (rfx_rlgr_decode) is accounting for about 61% (85*3 / 418) of the total RemoteFX processing time currently. This method might be hard to optimized using SSE, however, as it appears to be more stream/logic based than loop/calculation based. It is definitely worth taking a further look at, however, to see if there are other optimizations that can be made. It might also be worth taking a look at the 'other' category. I assume this includes the final assembly of the RGB data into it's output format. This might be able to be optimized using SSE still. FYI... I probably won't be able to push updates quite as fast over the next 2 weeks, as we are at the end of a large project at work that is requiring extra effort to get across the finish line. I would still like to see if there is any more performance we can get out of this code though. If someone on the list has SSE optimization experience, I would love a code review... particularly around order of operations and cache usage. We might be able to get another couple % improvement with some very minor changes. Lastly... I should get my new AMD Zacate based board tomorrow. Over the next couple of weeks, I want to take a stab at an alternate OpenCL accelerated version of this RemoteFX code as well. Any other interest or experience in this type of acceleration? Thanks, Steve -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev___ Freerdp-devel mailing list Freerdp-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freerdp-devel
[Freerdp-devel] Update on SSE2 for RemoteFX
I finished adding SSE2 optimizations for the Inverse DWT decoding routines this evening. Here are the current performance numbers from my Atom D510 test system: Without SSE: |---| PROFILER |elapsed seconds| ||---| | code section | iterations | total | avg. | |---||---|---| | rfx_decode_rgb| 57385 | 54.53 | 0.000950 | | rfx_decode_component | 172155 | 42.12 | 0.000245 | | rfx_rlgr_decode | 172155 | 10.56 | 0.61 | | rfx_differential_decode | 172155 | 0.24 | 0.01 | | rfx_quantization_decode | 172155 | 3.98 | 0.23 | | rfx_dwt_2d_decode | 172155 | 26.25 | 0.000152 | | rfx_decode_YCbCr_to_RGB | 57385 | 10.26 | 0.000179 | || With SSE: |---| PROFILER |elapsed seconds| ||---| | code section | iterations | total | avg. | |---||---|---| | rfx_decode_rgb| 47871 | 20.00 | 0.000418 | | rfx_decode_component | 143613 | 17.01 | 0.000118 | | rfx_rlgr_decode | 143613 | 12.23 | 0.85 | | rfx_differential_decode | 143613 | 0.15 | 0.01 | | rfx_quantization_decode_SSE2 | 143613 | 0.73 | 0.05 | | rfx_dwt_2d_decode_SSE2| 143613 | 3.06 | 0.21 | | rfx_decode_YCbCr_to_RGB_SSE2 | 47871 | 1.02 | 0.21 | || As you can see, we are currently getting a little more than 100% performance gain by using SSE. It is noticeably faster and more responsive as well. Looking at just the SSE vs. non-SSE methods we are getting 500% improvement. Running the numbers through a calculation (accounting for some of these methods being called more than others) gives this break-down: 61.00% rlgr 0.72% diff 3.59% quant (sse) 15.07% dwt (sse) 5.02% ycbcr (sse) 14.59% other So, the one large remaining non-SSE method (rfx_rlgr_decode) is accounting for about 61% (85*3 / 418) of the total RemoteFX processing time currently. This method might be hard to optimized using SSE, however, as it appears to be more stream/logic based than loop/calculation based. It is definitely worth taking a further look at, however, to see if there are other optimizations that can be made. It might also be worth taking a look at the 'other' category. I assume this includes the final assembly of the RGB data into it's output format. This might be able to be optimized using SSE still. FYI... I probably won't be able to push updates quite as fast over the next 2 weeks, as we are at the end of a large project at work that is requiring extra effort to get across the finish line. I would still like to see if there is any more performance we can get out of this code though. If someone on the list has SSE optimization experience, I would love a code review... particularly around order of operations and cache usage. We might be able to get another couple % improvement with some very minor changes. Lastly... I should get my new AMD Zacate based board tomorrow. Over the next couple of weeks, I want to take a stab at an alternate OpenCL accelerated version of this RemoteFX code as well. Any other interest or experience in this type of acceleration? Thanks, Steve -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev___ Freerdp-devel mailing list Freerdp-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freerdp-devel