Sorry it's taken so long to get back to this. On Wed, May 9, 2012 at 12:57 PM, Søren Sandmann <sandm...@cs.au.dk> wrote: > Matt Turner <matts...@gmail.com> writes: > I still think MMX has no use on modern systems. The SSE2 implementation > used to have such MMX loops, but they were removed in > f88ae14c15040345a12ff0488c7b23d25639e49b because there were issues with > compilers that would miscompile the emms instruction. > > Can't the MMX loop you added be done with SSE registers and instructions > as well?
The registers -- yes. The 8-byte aligned loads and stores I'm not sure. Can you do 8-byte aligned loads and stores to/from SSE registers? >> Porting the pmadd algorithm to SSE4.1 gave another (very large) >> improvement. >> >> fast: src_8888_0565 = L1: 655.18 L2: 675.94 M:642.31 ( 23.44%) HT:403.00 >> VT:286.45 R:307.61 RT:150.59 (1675Kops/s) >> mmx: src_8888_0565 = L1:2050.45 L2:1988.97 M:1586.16 ( 57.34%) HT:529.12 >> VT:374.28 R:412.09 RT:177.35 (1913Kops/s) >> sse2: src_8888_0565 = L1:1518.61 L2:1493.10 M:1279.18 ( 46.24%) HT:433.65 >> VT:314.48 R:349.14 RT:151.84 (1685Kops/s) >> sse2mmx:src_8888_0565 = L1:1544.91 L2:1520.83 M:1307.79 ( 47.01%) >> HT:447.82 VT:326.81 R:379.60 RT:174.07 (1878Kops/s) >> sse4: src_8888_0565 = L1:4654.11 L2:4202.98 M:1885.01 ( 69.35%) HT:540.65 >> VT:421.04 R:427.73 RT:161.45 (1773Kops/s) >> sse4mmx:src_8888_0565 = L1:4786.27 L2:4255.13 M:1920.18 ( 69.93%) >> HT:581.42 VT:447.99 R:482.27 RT:193.15 (2049Kops/s) >> >> I'd like to isolate exactly what the performance improvement given by >> the only SSE4.1 instruction (i.e., _mm_packus_epi32) is before declaring >> SSE4.1 a fantastic improvement. If you can come up with a reasonable way >> to pack the two xmm registers together in pack_565_2packedx128_128, >> please tell me. Shuffle only works on hi/lo 8-bytes, so it'd be a pain. > > Would it work to subtract 0x8000, then use packssdw, then add 0x8000? I couldn't make that work, but I asked on StackOverflow and got a nice solution: http://stackoverflow.com/questions/11024652/simulating-packusdw-functionality-with-sse2 >> I'd rather not duplicate a bunch of code from pixman-mmx.c, and I'd >> rather not add #ifdef USE_SSE41 to pixman-sse2.c and make it a >> compile-time option (or recompile the whole file to get a few >> improvements from SSE4.1). >> >> It seems like we need a generic solution that would say for each >> compositing function >> - this is what you do for 1-byte; >> - this is what you do for 8-bytes if you have MMX; >> - this is what you do for 16-bytes if you have SSE2; >> - this is what you do for 16-bytes if you have SSE3; >> - this is what you do for 16-bytes if you have SSE4.1. >> and then construct the functions for generic/MMX/SSE2/SSE4 at build time. >> >> Does this seem like a reasonable approach? *How* to do it -- suggestions >> welcome. > > I think ideally we would generate this code at runtime. It's just not > feasible to generate code for all combinations of instruction sets at > build time and libpixman.so is already rather large. Generating the code > at runtime has the additional advantages that it is not limited to a > fixed set of fast paths and that it can make use of more details of the > operation such as the precise alignment for palignr generation. > > There are various ways to go about this, ranging from simple-minded > stitching-together of pre-written snippets to a full shader compiler. A > full shader compiler is obviously a big project, but maybe a simple > stich-together kind of thing wouldn't actually be that hard using > something like this: > > http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.h?h=graph > http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.c?h=graph > > That said, runtime code-generation is still a big project, and it does > make sense to make use of some of the newer instruction sets. > > We do have support for fallbacks, so as Makoto-san says, just adding new > pixman-ssse3.c and pixman-sse41.c files with duplicated code for the > particular operations that benefit from ssse3 and sse4.1 might be the > simplest way to proceed. Indeed, runtime generation would be great. Something like LLVM or orc would be interesting options. I'm not sure I'm up to that kind of project yet/now though. I think adding pixman-sse*.c files is a reasonable measure for now. Think it's okay to split the static inline support functions from pixman-sse2.c out into a header to be shared with the other pixman-sse*.c files? Also, are we planning to change the bilinear scaling algorithm for 0.28 so that we can use pmaddubsw? Thanks, Matt _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman