Martin,

On 6/9/2011 7:09 AM, Martin Fleisz wrote:
One thing that will definitely hurt performance is if our memory is not 16-byte aligned. We should also have a possibility to overload the memory allocation in rfx_pool to use _mm_malloc/_mm_free to have correctly aligned buffers.

We should already be 16-byte memory aligned. I already modified the buffers to be aligned (look in rfx_context_init), and GCC automatically aligns the local __m128 variables. Looking at the disassembled code, GCC is outputting the aligned version of the instruction set. In fact, if we weren't aligned (and still used aligned instructions), we would be crashing with a seg fault or other exception (I have seen this in testing).

I will make an attempt to implement an integer version of the code ... (I noticed that there seems to be no max/min instructions for 32-bit integers so it might not be that straightforward to get it working)

I actually worked on an integer version of the code last night (err this morning). It is definitely faster than the floating point version on my machine, but (so far) has it's own problems. The first problem, as you mentioned, is that there is no 32-bit integer min/max instruction until you get to SSE4, which I feel is too new to rely on (at least for my purposes). The approach I took, is to use the 16-bit version of all instructions (available in SSE2). This has the advantage of 1/2 the memory requirement for the buffers and twice the throughput (because it can process 8 operations at a time instead of just 4). This also currently has a big disadvantage, however, in that we have to convert the buffers and supporting decoding routines to be uint16 based (from uint32). I must still have a bug in my attempt to do this conversion as am now getting some wierd color artifacts (regardless of original or sse version of the code). So, I either have a bug in the decoding routines that needs to be found, or 16 bit ints aren't big enough to hold all the information prior to color conversion.

Since Vic wrote the original decoding routines (I think), maybe he can weigh in on whether 16 bit ints should be big enough for our buffers, or if they actually have to be 32 bit ints?

I will check-in my integer version when I can verify that my approach will actually work. I probably won't be able to look at it again until later tonight.

Thanks,
 Steve
------------------------------------------------------------------------------
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
_______________________________________________
Freerdp-devel mailing list
Freerdp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freerdp-devel

Reply via email to