Martin,
On 6/9/2011 7:09 AM, Martin Fleisz wrote:
One thing that will definitely hurt performance is if our memory is
not 16-byte aligned. We should also have a possibility to overload the
memory allocation in rfx_pool to use _mm_malloc/_mm_free to have
correctly aligned buffers.
We should already be 16-byte memory aligned. I already modified the
buffers to be aligned (look in rfx_context_init), and GCC automatically
aligns the local __m128 variables. Looking at the disassembled code,
GCC is outputting the aligned version of the instruction set. In fact,
if we weren't aligned (and still used aligned instructions), we would be
crashing with a seg fault or other exception (I have seen this in testing).
I will make an attempt to implement an integer version of the code ...
(I noticed that there seems to be no max/min instructions for 32-bit
integers so it might not be that straightforward to get it working)
I actually worked on an integer version of the code last night (err this
morning). It is definitely faster than the floating point version on my
machine, but (so far) has it's own problems. The first problem, as you
mentioned, is that there is no 32-bit integer min/max instruction until
you get to SSE4, which I feel is too new to rely on (at least for my
purposes). The approach I took, is to use the 16-bit version of all
instructions (available in SSE2). This has the advantage of 1/2 the
memory requirement for the buffers and twice the throughput (because it
can process 8 operations at a time instead of just 4). This also
currently has a big disadvantage, however, in that we have to convert
the buffers and supporting decoding routines to be uint16 based (from
uint32). I must still have a bug in my attempt to do this conversion as
am now getting some wierd color artifacts (regardless of original or sse
version of the code). So, I either have a bug in the decoding routines
that needs to be found, or 16 bit ints aren't big enough to hold all the
information prior to color conversion.
Since Vic wrote the original decoding routines (I think), maybe he can
weigh in on whether 16 bit ints should be big enough for our buffers, or
if they actually have to be 32 bit ints?
I will check-in my integer version when I can verify that my approach
will actually work. I probably won't be able to look at it again until
later tonight.
Thanks,
Steve
------------------------------------------------------------------------------
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
_______________________________________________
Freerdp-devel mailing list
Freerdp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freerdp-devel