On 2002.04.10 17:42 Brian Paul wrote: > > José, > > I've checked in the code after testing with Glean and the OpenGL > conformance > tests. >
Great. > Was I supposed to change something in the C code? It passes the > conformance tests as-is. > I was surprised that the C code passed the conformance tests, because of the signed arithmetic it doesn't give the same results as before. So I've made a small comparision with the several methods (test program attached): // Nathan's method - unsigned 24bit arithmetic // NOTE: this was the original Mesa code t1 = p*a + q*(255 - a); s1 = (t1 + (t1 << 8) + 256) >> 16; // Nathan's method - signed 24bit arithmetic (less one multiply) // NOTE: this is how I changed and is now t2 = (p - q)*a; s2 = (t2 + (t2 << 8) + 256) >> 16; s2 += q; s2 &= 0xff; // Blin's method - unsigned 16bit arithmetic // NOTE: is exact t3 = p*a + q*(255-a) + 128; s3 = (t3 + (t3 >> 8)) >> 8; // Blin's method - signed 16bit arithmetic (less one multiply) // NOTE: is exact because the negative sign is considered t4 = ((p - q)*a + (p > q ? 128 : -128)) & 0xffff; s4 = (t4 + (t4 >> 8)) >> 8; s4 += q; s4 &= 0xff; When one compares with the exact result // exact result - rounded s = (unsigned) (((double)p)*(((double)a)/255.0) + ((double)q)*(1.0-((double)a)/255.0) + 0.5); one gets: 1: 8164890 differences in 16777216 2: 8148697 differences in 16777216 3: 0 differences in 16777216 4: 0 differences in 16777216 So spite of the different results between 1 and 2, 2 gives better results overall!! What happens is that method 1 is aimed to follow the truncated results and not the rounded. If one compares with the truncated result // truncated result s = (unsigned) (((double)p)*(((double)a)/255.0) + ((double)q)*(1.0-((double)a)/255.0)); one gets: 1: 15467 differences in 16777216 2: 31660 differences in 16777216 3: 8180357 differences in 16777216 4: 8180357 differences in 16777216 Notice that, by this point of view, the method 2 is indeed worst, but this really doesn't matter because is the wrong point of view. This explains why the current C code passes the conformance tests. At this moment the MMX code implements method 4, which is very fast. There is no point in implement method 2, spite being a little faster than method 4 (because of the simpler rounding) because it would requite 24bit arithmetic instead of 16, so less numbers could be multiplied at the same time. So, in contrary of what I thought, there is no need to switch to method 1. When I implement the double blend trick I will have to use the method 4, again for the same reasons of above. But since the specs give some tolerance it would be nice to run the conformance tests with different settings in mmx_blend.S, specially the "single multiply w/o rouding" which would give at least 5% improvement (it will be a little more because it would allow to free some registers allowing to leaving some necessary constants there). For that is just necessary to change #define GMBT_ROUNDOFF 0 leaving the rest as before #define GMBT_ALPHA_PLUS_ONE 0 #define GMBT_GEOMETRIC_SERIES 1 #define GMBT_SIGNED_ARITHMETIC 1 Using the alpha+1 method and not using the geometric series would be the even faster but it is already marked on the C code as rejected by glean... > Thanks for you work! > > -Brian > Regards, José Fonseca
#include <stdio.h> #include <stdlib.h> int main() { unsigned short p, q, a; unsigned c1 = 0, c2 = 0, c3 = 0, c4 = 0; for (p = 0; p <= 255; ++p) for (q = 0; q <= 255; ++q) for (a = 0; a <= 255; ++a) { unsigned s; unsigned s1, s2, s3, s4; unsigned t1, t2, t3, t4; #if 1 // exact result - rounded s = (unsigned) (((double)p)*(((double)a)/255.0) + ((double)q)*(1.0-((double)a)/255.0) + 0.5); #else // truncated result s = (unsigned) (((double)p)*(((double)a)/255.0) + ((double)q)*(1.0-((double)a)/255.0)); #endif // Nathan's method - unsigned 24bit arithmetic t1 = p*a + q*(255 - a); s1 = (t1 + (t1 << 8) + 256) >> 16; // Nathan's method - signed 24bit arithmetic t2 = (p - q)*a; s2 = (t2 + (t2 << 8) + 256) >> 16; s2 += q; s2 &= 0xff; // Blin's method - unsigned 16bit arithmetic // NOTE: is exact t3 = p*a + q*(255-a) + 128; s3 = (t3 + (t3 >> 8)) >> 8; // Blin's method - signed 16bit arithmetic // NOTE: is exact because the negative sign is considered t4 = ((p - q)*a + (p > q ? 128 : -128)) & 0xffff; s4 = (t4 + (t4 >> 8)) >> 8; s4 += q; s4 &= 0xff; if(s1 != s) ++c1; if(s2 != s) ++c2; if(s3 != s) ++c3; if(s4 != s) ++c4; if (s1 != s || s2 != s || s3 != s || s4 != s) { // printf("%3ux%3ux%3u:\t(%3u)\t%3u\t%3u\t%3u\t%3u\n", p, a, q, s, s1, s2, s3, s4); } } printf("1: %u differences in %u\n", c1, 256*256*256); printf("2: %u differences in %u\n", c2, 256*256*256); printf("3: %u differences in %u\n", c3, 256*256*256); printf("4: %u differences in %u\n", c4, 256*256*256); return 0; }