Eterm devs, PATCHES:
1. Like I suspected the 15bpp w/ saturation C shading routine flips on too many bits in adjacent colors just like the 16bpp C routine did. This patch corrects that behavior. 2. I'm not sure how 15bpp is defined. If the highest bit should always be zero then there is a bug in the 15bpp MMX shading routines. I noticed it comparing the output of my new 15bpp shading routine to the old one. If the 15th pixel is ALWAYS, and WILL ALWAYS BE, ignored then this is not an issue. If not then the 15bpp MMX shading routine leaves overflow from the red color modification in the left most bit and it should be cleared. I have attached a patch to do just that. This is only an issue in the saturation section as mathematics say that without saturation red will never overflow. NOTES: The status of the x86_64 port of the MMX routines is kinda dead. While doing the port it occurred to me that all 64bit processors will have at least SSE2 with 128bit Multi-Media registers and that there is no reason that I shouldn't take advantage of that. As a result the *NEW* SSE2 port of the shading routines is as follows: 1. The 15bpp SSE2 shading routines are complete and verified to shade identically to the 15bpp (patched) C shading routines. They shade 8 pixels per pass until pixels_remaining_for_line / 8 = 0 and then shade one at a time. This is twice as many pixels per pass as the MMX routines and we should see a corresponding speed improvement as well. 2. The 16bpp SSE2 shading routines are complete and verified to shade identically to the 16bpp (patched) C shading routines. The same performance boost as 15bpp mode has been included. 3. The 32bpp routine is currently working with the 64 bit MMX registers and processing one pixel at a time. I hope to convert it to use the full 128 bits and process two pixels at a time. That is the max as room for overflow is needed (see note below). This will more than double the complexity of this routine but also double its performance. 4. The 24bpp routine is still under investigation. There is not a 24bpp MMX shading routine but that isn't the problem. The problem is moving 24 bits of data into a processor's register and zero padding the remainder of the pixel to a byte boundary of 2^n (where n is non-negative and whole). 24 bits = 3 bytes and there is no 'n' that works directly. The only solution is to read a byte at a time. That's three reads and three writes for each pixel. That is actually what the C routine does by manipulating the three unsigned chars. Once each pixel is loaded the shading is identical to the 32bpp routines but the overhead of unpacking the 24 bits into 32 and then repacking is not looking to be worth it, especially if after all of that work we can only process two pixels at a time. I attempted a work around that reads the data 32 bits at a time and simply writes the top most 8 bits back out when storing the other 24 bits of the pixel. If anybody has any suggestions I've overlooked on this topic then _PLEASE_ speak up. Things to note (maybe for the Eterm man page under --cmod): All of the colors of all of the pixels need some room for overflow during the intermediate steps of the shading. Although no hard errors will occur strange behavior will happen when the color * modifier exceeds the temporary storage. For 15 & 16 bpp mode overflow bits are: 15bpp 5 bits red 5 bits green 5 bits blue 3 bit overflow 3 bit overflow 3 bit overflow 16bpp 5 bits red 6 bits green 5 bits blue 3 bit overflow 2 bit overflow 3 bit overflow This is true for all the shading routines: C, MMX, and SSE2. In 24 & 32 bpp modes the color consumes the entire byte and so a word is used for the intermediate values. Therefore each color of each pixel has a full 8 bits for overflow. The colors are still condensed back to 8 bits upon completion though. It is impossible to use a couple of bits from the alpha channel for overflow as the working size must be byte aligned and the first size above 8 bits is 16. While lurking on the #gento-dev channel I noticed some of the devs bitching about the register allocator in gcc (v. 4 I think). The MMX routines expect the register allocator to behave a certain way and will bitch loudly (or SEG_FAULT) if its behavior changes. (The incoming parameters will be in unpredictable locations). To avoid any problems with this issue I have opted to write the SSE2 routines using inline assembly. Even if I had written it in pure assembly combining it with the mmx_cmod.S would have required more #ifdef 's than code. Sorry Mej! :-/ I started to do it that way for you but if you saw the code you'd flip. Much more detailed info is in the comments at the top of the new file and will be submitted soon. Is there a way to look at Eterm-0.9.4/src/pixmap.c without getting the entire CVS tree? A link to a web page with the latest pixmap.c source in CVS would be awesome! TIA. Time to sleeeeep, The River Rat P.S. I still have the MMX port to 64bit for the 32 & 16 bpp shading routines if anyone is interested. These routines run on a 64 bit processor but use standard MMX calls (not SSE2) and only use 64bits of the Multi-Media registers. If anyone is interested in them speak up now or I'll probably just delete them at the completion of the SSE2 port. They DO work though! 8-) -- Tres
--- Eterm-0.9.3-orig/src/mmx_cmod.S 2004-01-11 15:13:02.000000000 -0700 +++ Eterm-0.9.3/src/mmx_cmod.S 2005-05-07 07:45:07.000000000 -0600 @@ -198,6 +201,7 @@ paddusw %mm3, %mm1 /* ff eg */ paddusw %mm3, %mm2 /* ff eb */ + psubw %mm3, %mm0 /* 00 0r */ psubw %mm3, %mm1 /* 00 0g */ psubw %mm3, %mm2 /* 00 0b */ @@ -234,6 +238,7 @@ paddusw %mm3, %mm1 /* ff eg */ paddusw %mm3, %mm2 /* ff eb */ + psubw %mm3, %mm0 /* 00 0r */ psubw %mm3, %mm1 /* 00 0g */ psubw %mm3, %mm2 /* 00 0b */
--- Eterm-0.9.3-orig/src/pixmap.c 2004-07-22 14:12:31.000000000 -0600 +++ Eterm-0.9.3/src/pixmap.c 2005-05-07 07:54:00.000000000 -0600 @@ -1559,16 +1590,13 @@ for (x = -w; x < 0; x++) { int r, g, b; - b = ((DATA16 *) ptr)[x]; - r = (b & 0x7c00) * rm; - g = (b & 0x3e0) * gm; - b = (b & 0x1f) * bm; - r |= (!(r >> 15) - 1); - g |= (!(g >> 10) - 1); - b |= (!(b >> 5) - 1); - ((DATA16 *) ptr)[x] = ((r >> 8) & 0x7c00) - | ((g >> 8) & 0x3e0) - | ((b >> 8) & 0x1f); + r = ( (b >> 10 ) * rm ) >> 8; + r = ( r > 0x001f ) ? 0xfc00 : ( r << 10 ); + g = (((b >> 5 ) & 0x003f ) * gm ) >> 8; + g = ( g > 0x001f ) ? 0x03e0 : ( g << 5 ); + b = (( b & 0x001f ) * bm ) >> 8; + b = ( b > 0x001f ) ? 0x001f : b; + ((DATA16 *) ptr)[x] = (r|g|b); } ptr += bpl; }