Re: [ft-devel] FT_MulFix assembly
Werner: Miles' version is shorter, is only wrong by one ulp and only when the product overflows and is negative. My variation, called another() above, fixes that slight difference. Which would you prefer, if anything? I tend to prefer the faster one $(Q#|(B IIRC, we never rely on product overflow of 32bit entities. I can write a patch which uses it only when it works (64 bit sizeof(long)). Please do so. Werner ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
James Cloos cl...@jhcloos.com writes: The C version does away-from-zero rounding. MB Do you have test cases that show this? I tried using random inputs, MB but even up to billions of iterations, I can't seem to find a set of MB inputs where my function yields different results from yours. The C version saves the two signs, takes the absolute values, multiplies, scales and then sets the sign. When I tested, I used dd(1) to generate a quarter-gig file from urandom (I used a fixed file so that it would be reproducable), mmap(2)ed that to an int[], and went through two at a time. The C and my initial asm versions produced different results whenever the second int was -1 (ie 0x) and the first matched: (a 0 (a 0x == 0x8000)). In other words, multiplying something like 7.5 by -1/65536. An example of that test's output was: 7AFA8000, , 8505, 8506, 0 In that example, 8505 is what the C version generates. Hm, are you sure that's not backwards? When I tried the git C version[*], as well as your most recent FT_MulFix_x86_64, it returned 0x8506... The following C version: typedef signed int FT_Int; typedef signed long FT_Long; typedef signed long FT_Int64; /* on x86-64 */ FT_Long FT_MulFix_C_new( FT_Long a, FT_Long b ) { return (((FT_Int64)a * (FT_Int64)b) + 0x8000) 16; } ... generates this code: imulq %rsi, %rdi leaq32768(%rdi), %rax sarq$16, %rax It seems to yield exactly the same results as the offical C version[*], both for your test case: $ ./t 0x7AFA8000 0x 0x7afa8000 x 0x = C: 0x7fff8506 C_new: 0x7fff8506 asm: 0x7fff8506 ... and also for billions of random inputs. Is there something I'm missing...? [*] Fetched from: http://git.savannah.gnu.org/cgit/freetype/freetype2.git/tree/src/base/ftcalc.c Thanks, -Miles -- Liberty, n. One of imagination's most precious possessions. ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
MB == Miles Bader mi...@gnu.org writes: MB Hm, are you sure that's not backwards? When I tried the git C version[*], MB as well as your most recent FT_MulFix_x86_64, it returned 0x8506... Odd. Adding your algo to my test app, I get: 7AFA8000, , 8505, 8505, 8506 #a , b ,FT ,JC ,MB I see that I have one small error in the C code in my app. FT has: c = (FT_Long)( ( (FT_Int64)a * b + 0x8000L ) 16 ); whereas I used: c = (int32_t)(((int64_t)a*b + 0x8000L) 16); But changing the int32_t to long does not change the results. Yours still is always +1 compared to the C, whenever the first arg represents a positive value with fractional part == 1/2. Oddly, though, gcc now refuses to compile my asm, even though it did do so before, complaining that I cannot guess what arg size to use for the imul Wierd. (The existing executables prove that it used to.) A simple way around that is to specify D and S as the contraints for a and b. (The rdi and rsi regesters are where the x86_64 abi puts the first two args which are passed to a function.) The disassembly of the final version is: 004006c0 mf: 4006c0: 48 89 f8mov%rdi,%rax 4006c3: 48 f7 eeimul %rsi 4006c6: 48 01 d0add%rdx,%rax 4006c9: 48 05 00 80 00 00 add$0x8000,%rax 4006cf: 48 c1 f8 10 sar$0x10,%rax 4006d3: c3 retq And I get this disassembly of yours: 00400840 miles: 400840: 48 63 c6movslq %esi,%rax 400843: 48 63 ffmovslq %edi,%rdi 400846: 48 0f af c7 imul %rdi,%rax 40084a: 48 05 00 80 00 00 add$0x8000,%rax 400850: 48 c1 f8 10 sar$0x10,%rax 400854: c3 retq I also just added this version to my test app: int another (int32_t a, int32_t b) { long r = (long)a * (long)b; long s = r 31; return (r + s + 0x8000) 16; } That results in: 00400760 another: 400760: 48 63 ffmovslq %edi,%rdi 400763: 48 63 f6movslq %esi,%rsi 400766: 48 0f af f7 imul %rdi,%rsi 40076a: 48 89 f0mov%rsi,%rax 40076d: 48 c1 f8 1f sar$0x1f,%rax 400771: 48 8d 84 06 00 80 00lea0x8000(%rsi,%rax,1),%rax 400778: 00 400779: 48 c1 f8 10 sar$0x10,%rax 40077d: c3 retq Since FT's C version uses longs, though, this: int another (long a, long b) { long r = (long)a * (long)b; long s = r 31; return (r + s + 0x8000) 16; } gives: 00400760 another: 400760: 48 0f af f7 imul %rdi,%rsi 400764: 48 89 f0mov%rsi,%rax 400767: 48 c1 f8 1f sar$0x1f,%rax 40076b: 48 8d 84 06 00 80 00lea0x8000(%rsi,%rax,1),%rax 400772: 00 400773: 48 c1 f8 10 sar$0x10,%rax 400777: c3 retq So it would seem that when compiling for any processor where FT_Long is the same as int64_t and where that fits into a single register, then that last bit of C might be optimal, yes? -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
James Cloos cl...@jhcloos.com writes: Since FT's C version uses longs, though, this: int another (long a, long b) { long r = (long)a * (long)b; long s = r 31; return (r + s + 0x8000) 16; } That's not correct though, is it? The variable s should be the all sign portion of the multiplication, but since the two inputs have 32 significant bits (never mind the types), the product will have 64 significant bits. So r 31 won't be all-sign, it'll be a bunch of ... other bits. :) However, changing the shift to 63: FT_Long FT_MulFix_C_new2( FT_Long a, FT_Long b ) { FT_Int64 prod = (FT_Int64)a * (FT_Int64)b; FT_Int64 sign = prod 63; return ((prod + sign + 0x8000) 16); } ... does seem to yield correct results: $ ./t 0x7AFA8000 0x 0x7afa8000 x 0x = C: 0x8505 C_new: 0x8505 C_nw2: 0x8505 C_ano: 0x8505 asm: 0x8505 C is the old C, C_new was my previous attempt, C_nw2 is the above FT_MulFix_C_new2 function, C_ano is the another function, and asm was your final asm version. [another yields misleadingly correct results in this case, because of the particular argument values given; in other cases, it gives incorrect results.] -miles -- Discriminate, v.i. To note the particulars in which one person or thing is, if possible, more objectionable than another. ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
Have you done an ARM version? Forgive my inattentiveness if you've already announced one. It just struck me that this sort of optimisation is even more necessary on mobile devices. Graham James Cloos wrote: The final result for amd64 looks like: static __inline__ long FT_MulFix_x86_64( long a, long b ) { register long result; __asm__ __volatile__ ( movq %1, %%rax\n imul %2\n addq %%rdx, %%rax\n addq $0x8000, %%rax\n sarq $16, %%rax\n : =a(result) : g(a), g(b) : rdx ); return result; } The use of long, though requires review. The C version uses FT_Long (not FT_Int32 like the other asm versions), but FT_Long is not a #define or a typedef at the point where the asm version are located. That said, using long there on amd64 prevents unnecessary 32-64 bit conversions in the resulting code. The above code has a latency of 1+5+1+1+1 = 10 clocks on an amdfam10 cpu. The assembly generated by the C code is 45 lines and 158 octets long, contains six conditional jumps, three each of explicit compares and tests, and still benchmarks are just as fast. Out-of-order processing wins out over hand-coded asm. :-/ It *might* make more of a difference on an in-order processor like the Arom. But I do not have one to test. I can still finish a patch, and have collected the info I need to do one for mips64, too, where I expect it will be more important. I also expect that the i386 version could be tidied a bit. Is the amd64 version desired, given how little benefit it has? -JimC ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
James Cloos cl...@jhcloos.com writes: __asm__ __volatile__ ( movq %1, %%rax\n imul %2\n addq %%rdx, %%rax\n addq $0x8000, %%rax\n sarq $16, %%rax\n : =a(result) : g(a), g(b) : rdx ); The above code has a latency of 1+5+1+1+1 = 10 clocks on an amdfam10 cpu. ... Is the amd64 version desired, given how little benefit it has? If this is being used in a context where it might benefit from more scheduling, etc, perhaps it would help to let the compiler generate the non-imul insns (since it's pretty good at those)? E.g. something like: static __inline__ long FT_MulFix_x86_64 (long a, long b) { register long mr1, mr2; __asm__ (imul %3\n : =a (mr1), =d (mr2) : a (a), g (b)); return ((mr1 + mr2) + 0x8000) 16; } -miles -- Omochiroi! ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
Incidentally, you wrote: The assembly generated by the C code is 45 lines and 158 octets long, contains six conditional jumps, three each of explicit compares and tests, and still benchmarks are just as fast. Out-of-order processing wins out over hand-coded asm. :-/ ... but when I follow your original suggestion, and just do the following: typedef long s64; typedef unsigned int u32; static __inline__ u32 FT_MulFix_x86_64 (u32 a, u32 b) { return (((s64)a * (s64)b) + 0x8000) 16; } The compiler generates the following assembly: mov %esi, %eax mov %edi, %edi imulq %rdi, %rax addq$32768, %rax shrq$16, %rax The movs there are obviously a bit silly (compiler bug?), but that output seems reasonably close to the asm() version, and obviously much more schedulable since the compiler knows what the insns do... [I tried it with gcc-4.x for a few different xs, and the results are the same, and optimization flags don't seem to make much difference either.] -Miles -- Scriptures, n. The sacred books of our holy religion, as distinguished from the false and profane writings on which all other faiths are based. ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
Miles Bader mi...@gnu.org writes: The compiler generates the following assembly: mov %esi, %eax mov %edi, %edi imulq %rdi, %rax addq$32768, %rax shrq$16, %rax The movs there are obviously a bit silly (compiler bug?), but that output seems reasonably close to the asm() version, and obviously much more schedulable since the compiler knows what the insns do... [...Actually the redundant movs, while they may be a compiler bug, seem to be caused by the way I structured my test program; in a more realistic circumstance I don't see them.] -miles -- Abstainer, n. A weak person who yields to the temptation of denying himself a pleasure. A total abstainer is one who abstains from everything but abstention, and especially from inactivity in the affairs of others. ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
GA == Graham Asher graham.as...@btinternet.com writes: GA Have you done an ARM version? Forgive my inattentiveness if you've GA already announced one. It just struck me that this sort of GA optimisation is even more necessary on mobile devices. I386, arm and arm-thumb versions were already there. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
MB == Miles Bader mi...@gnu.org writes: MB The compiler generates the following assembly: MB mov %esi, %eax MB mov %edi, %edi MB imulq %rdi, %rax MB addq$32768, %rax MB shrq$16, %rax That does not match the C code though; it rounds negative values wrong. The C version does away-from-zero rounding. Using the single arg version of imulq generates a 128 bit result; the more significant part of which will be 0 iff the product is =0 and will be -1 if the product is 0, given that the multiplicands were only 32 bits. Adding that, in addition to the 32768, to rax ensures that the result of the =16 is rounded the way freetype wants. If you use the two arg version of imul, you have to copy the msb of the result (or do a compare and jump, like the C code) to determine whether to add 0x8000 or 0x7FFF. Matching the rounding was the hardest part; noting that the upper 64 bits of the 128-bit product would always be just sign-extension bits and that, because of the prototype of FT_MulFix() itself, the vaules are already promoted to 64 bits before they get to the assembly were what provided the most (in-order) speedups. If it can be done better, though, I'd be happy to know! Thanks for also looking at it. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
On Tue, Sep 7, 2010 at 4:28 AM, James Cloos cl...@jhcloos.com wrote: MB == Miles Bader mi...@gnu.org writes: MB The compiler generates the following assembly: MB mov %esi, %eax MB mov %edi, %edi MB imulq %rdi, %rax MB addq $32768, %rax MB shrq $16, %rax That does not match the C code though; it rounds negative values wrong. The C version does away-from-zero rounding. Do you have test cases that show this? I tried using random inputs, but even up to billions of iterations, I can't seem to find a set of inputs where my function yields different results from yours. Thanks, -Miles -- Cat is power. Cat is peace. ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
MB == Miles Bader mi...@gnu.org writes: The C version does away-from-zero rounding. MB Do you have test cases that show this? I tried using random inputs, MB but even up to billions of iterations, I can't seem to find a set of MB inputs where my function yields different results from yours. The C version saves the two signs, takes the absolute values, multiplies, scales and then sets the sign. When I tested, I used dd(1) to generate a quarter-gig file from urandom (I used a fixed file so that it would be reproducable), mmap(2)ed that to an int[], and went through two at a time. The C and my initial asm versions produced different results whenever the second int was -1 (ie 0x) and the first matched: (a 0 (a 0x == 0x8000)). In other words, multiplying something like 7.5 by -1/65536. An example of that test's output was: 7AFA8000, , 8505, 8506, 0 In that example, 8505 is what the C version generates. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
The final result for amd64 looks like: static __inline__ long FT_MulFix_x86_64( long a, long b ) { register long result; __asm__ __volatile__ ( movq %1, %%rax\n imul %2\n addq %%rdx, %%rax\n addq $0x8000, %%rax\n sarq $16, %%rax\n : =a(result) : g(a), g(b) : rdx ); return result; } The use of long, though requires review. The C version uses FT_Long (not FT_Int32 like the other asm versions), but FT_Long is not a #define or a typedef at the point where the asm version are located. That said, using long there on amd64 prevents unnecessary 32-64 bit conversions in the resulting code. The above code has a latency of 1+5+1+1+1 = 10 clocks on an amdfam10 cpu. The assembly generated by the C code is 45 lines and 158 octets long, contains six conditional jumps, three each of explicit compares and tests, and still benchmarks are just as fast. Out-of-order processing wins out over hand-coded asm. :-/ It *might* make more of a difference on an in-order processor like the Arom. But I do not have one to test. I can still finish a patch, and have collected the info I need to do one for mips64, too, where I expect it will be more important. I also expect that the i386 version could be tidied a bit. Is the amd64 version desired, given how little benefit it has? -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
I have to finish the patch, but I thought I'd offer the algorithm for review, if anyone wants to. I haven't enough knowledge to comment, but thanks for working on it! Werner ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
My first cut at FT_MulFix_x86_64() is: static __inline__ FT_Int32 FT_MulFix_x86_64 (FT_Int32 a, FT_Int32 b) { register FT_Int32 r; __asm__ __volatile__ ( movslq %%edx, %%rdx\n cltq\n imul %%rdx\n addq %%rdx, %%rax\n addq $0x8000, %%rax\n sarq $16, %%rax\n : =a(r) : a(a), d(b)); return r; } It passes a monte-carlo test comparing its results to the C code and to the i386 assembly. The logic is simple. The first two instructions sign-extend the two values to 64 bits, the multiply puts the least significant 64 bits of the product in rax and the most significant bits in rdx; because the values started out as 32 bit, rdx is guaranteed to be only sign bits: zero if the product is =0, else -1. Adding the resulting rdx to rax serves the same purpose as the ecx value in the i386 version: it makes the rounding symmetric around zero, just like the C code. An alternative might be to cast the src values to (FT_Int64), but I doubt that the compiler would generate any better code than calling movslq and cltq. I have to finish the patch, but I thought I'd offer the algorithm for review, if anyone wants to. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
[ft-devel] FT_MulFix assembly
I see implementations for ia32 and arm; would other platforms benefit from assembply implementations of MulFix? I know I could translate the i386 assembly into amd64. Probably others with sufficient documentation. But I can only test on amd64 and, thanks to cray-cyber.org, out of date sparc and mips. A gcc-cfarm account might permit wider testing, though. I don't know how much of the market mips, sh4, et al have for small devices with gui, but those types of devices would seem to have the most need for improved performance, ja? Is there interest? And is there a good cli to use for testing; I'll need to output renderings to png or netpbm when testing on remote boxen -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
I see implementations for ia32 and arm; would other platforms benefit from assembply implementations of MulFix? As usual: patches are highly welcomed. Werner ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel