Re: [ft-devel] FT_MulFix assembly
Have you done an ARM version? Forgive my inattentiveness if you've already announced one. It just struck me that this sort of optimisation is even more necessary on mobile devices. Graham James Cloos wrote: The final result for amd64 looks like: static __inline__ long FT_MulFix_x86_64( long a, long b ) { register long result; __asm__ __volatile__ ( movq %1, %%rax\n imul %2\n addq %%rdx, %%rax\n addq $0x8000, %%rax\n sarq $16, %%rax\n : =a(result) : g(a), g(b) : rdx ); return result; } The use of long, though requires review. The C version uses FT_Long (not FT_Int32 like the other asm versions), but FT_Long is not a #define or a typedef at the point where the asm version are located. That said, using long there on amd64 prevents unnecessary 32-64 bit conversions in the resulting code. The above code has a latency of 1+5+1+1+1 = 10 clocks on an amdfam10 cpu. The assembly generated by the C code is 45 lines and 158 octets long, contains six conditional jumps, three each of explicit compares and tests, and still benchmarks are just as fast. Out-of-order processing wins out over hand-coded asm. :-/ It *might* make more of a difference on an in-order processor like the Arom. But I do not have one to test. I can still finish a patch, and have collected the info I need to do one for mips64, too, where I expect it will be more important. I also expect that the i386 version could be tidied a bit. Is the amd64 version desired, given how little benefit it has? -JimC ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] freetype.com
phil song phils...@techtrex.com writes: Excuse me ,I don't understand why we need tell the slashdot.org people? It's just a quick way to generate a lot of publicity about monotype's dirty tricks... -Miles -- We are all lying in the gutter, but some of us are looking at the stars. -Oscar Wilde ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
James Cloos cl...@jhcloos.com writes: __asm__ __volatile__ ( movq %1, %%rax\n imul %2\n addq %%rdx, %%rax\n addq $0x8000, %%rax\n sarq $16, %%rax\n : =a(result) : g(a), g(b) : rdx ); The above code has a latency of 1+5+1+1+1 = 10 clocks on an amdfam10 cpu. ... Is the amd64 version desired, given how little benefit it has? If this is being used in a context where it might benefit from more scheduling, etc, perhaps it would help to let the compiler generate the non-imul insns (since it's pretty good at those)? E.g. something like: static __inline__ long FT_MulFix_x86_64 (long a, long b) { register long mr1, mr2; __asm__ (imul %3\n : =a (mr1), =d (mr2) : a (a), g (b)); return ((mr1 + mr2) + 0x8000) 16; } -miles -- Omochiroi! ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
Incidentally, you wrote: The assembly generated by the C code is 45 lines and 158 octets long, contains six conditional jumps, three each of explicit compares and tests, and still benchmarks are just as fast. Out-of-order processing wins out over hand-coded asm. :-/ ... but when I follow your original suggestion, and just do the following: typedef long s64; typedef unsigned int u32; static __inline__ u32 FT_MulFix_x86_64 (u32 a, u32 b) { return (((s64)a * (s64)b) + 0x8000) 16; } The compiler generates the following assembly: mov %esi, %eax mov %edi, %edi imulq %rdi, %rax addq$32768, %rax shrq$16, %rax The movs there are obviously a bit silly (compiler bug?), but that output seems reasonably close to the asm() version, and obviously much more schedulable since the compiler knows what the insns do... [I tried it with gcc-4.x for a few different xs, and the results are the same, and optimization flags don't seem to make much difference either.] -Miles -- Scriptures, n. The sacred books of our holy religion, as distinguished from the false and profane writings on which all other faiths are based. ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
Miles Bader mi...@gnu.org writes: The compiler generates the following assembly: mov %esi, %eax mov %edi, %edi imulq %rdi, %rax addq$32768, %rax shrq$16, %rax The movs there are obviously a bit silly (compiler bug?), but that output seems reasonably close to the asm() version, and obviously much more schedulable since the compiler knows what the insns do... [...Actually the redundant movs, while they may be a compiler bug, seem to be caused by the way I structured my test program; in a more realistic circumstance I don't see them.] -miles -- Abstainer, n. A weak person who yields to the temptation of denying himself a pleasure. A total abstainer is one who abstains from everything but abstention, and especially from inactivity in the affairs of others. ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
[ft-devel] latest patch file for spline flattening
Here's a new version of my spline flattening patch. (I would like to be able to push this to the git repository but am having authentication problems; Werner has been helping me, but no success so far, probably because of my ineptitude in these matters.). The nub of the latest change is that I found that predicting the number of recursion levels is not reliable when splitting a cubic spline for flattening. A better way is to check the flatness inside the loop - using the fast heuristic of taking the maximum coordinate difference of a control point from the chord midpoint. This also makes the code simpler - and, surprisingly, faster, according to my benchmark; however, my benchmark is based on cartographic, not typographic, use, so will need confirmation. The patch obviously still solves the bug involving s-shaped curves (control points on both sides of the chord). Graham diff --git a/C:\\DOCUME~1\\Graham\\LOCALS~1\\Temp\\ftgrays_c8f5b9.c b/C:\\FreeType\\freetype2\\src\\smooth\\ftgrays.c index 0b94143..3a5e3cf 100644 --- a/C:\\DOCUME~1\\Graham\\LOCALS~1\\Temp\\ftgrays_c8f5b9.c +++ b/C:\\FreeType\\freetype2\\src\\smooth\\ftgrays.c @@ -90,6 +90,9 @@ #undef FT_COMPONENT #define FT_COMPONENT trace_smooth +/* The maximum distance of a curve from the chord, in 64ths of a pixel; */ +/* used when flattening curves. */ +#define FT_MAX_CURVE_DEVIATION 16 #ifdef _STANDALONE_ @@ -354,8 +357,6 @@ typedef ptrdiff_t FT_PtrDist; int band_size; int band_shoot; -int conic_level; -int cubic_level; ft_jmp_buf jump_buffer; @@ -888,31 +889,18 @@ typedef ptrdiff_t FT_PtrDist; if ( dx dy ) dx = dy; -level = 1; -dx = dx / ras.conic_level; -while ( dx 0 ) +if ( dx = FT_MAX_CURVE_DEVIATION ) { - dx = 2; - level++; + gray_render_line( RAS_VAR_ UPSCALE( to-x ), UPSCALE( to-y ) ); + return; } -/* a shortcut to speed things up */ -if ( level = 1 ) +level = 1; +dx /= FT_MAX_CURVE_DEVIATION; +while ( dx 1 ) { - /* we compute the mid-point directly in order to avoid */ - /* calling gray_split_conic() */ - TPos to_x, to_y, mid_x, mid_y; - - - to_x = UPSCALE( to-x ); - to_y = UPSCALE( to-y ); - mid_x = ( ras.x + to_x + 2 * UPSCALE( control-x ) ) / 4; - mid_y = ( ras.y + to_y + 2 * UPSCALE( control-y ) ) / 4; - - gray_render_line( RAS_VAR_ mid_x, mid_y ); - gray_render_line( RAS_VAR_ to_x, to_y ); - - return; + dx = 2; + level++; } arc = ras.bez_stack; @@ -957,21 +945,9 @@ typedef ptrdiff_t FT_PtrDist; } Draw: - { -TPos to_x, to_y, mid_x, mid_y; - - -to_x = arc[0].x; -to_y = arc[0].y; -mid_x = ( ras.x + to_x + 2 * arc[1].x ) / 4; -mid_y = ( ras.y + to_y + 2 * arc[1].y ) / 4; - -gray_render_line( RAS_VAR_ mid_x, mid_y ); -gray_render_line( RAS_VAR_ to_x, to_y ); - -top--; -arc -= 2; - } + gray_render_line( RAS_VAR_ arc[0].x, arc[0].y ); + top--; + arc -= 2; } return; @@ -1011,56 +987,9 @@ typedef ptrdiff_t FT_PtrDist; const FT_Vector* control2, const FT_Vector* to ) { -int top, level; -int*levels; FT_Vector* arc; -int mid_x = ( DOWNSCALE( ras.x ) + to-x + - 3 * (control1-x + control2-x ) ) / 8; -int mid_y = ( DOWNSCALE( ras.y ) + to-y + - 3 * (control1-y + control2-y ) ) / 8; -TPosdx = DOWNSCALE( ras.x ) + to-x - ( mid_x 1 ); -TPosdy = DOWNSCALE( ras.y ) + to-y - ( mid_y 1 ); -if ( dx 0 ) - dx = -dx; -if ( dy 0 ) - dy = -dy; -if ( dx dy ) - dx = dy; - -level = 1; -dx /= ras.cubic_level; -while ( dx 0 ) -{ - dx = 2; - level++; -} - -if ( level = 1 ) -{ - TPos to_x, to_y; - - - to_x = UPSCALE( to-x ); - to_y = UPSCALE( to-y ); - - /* Recalculation of midpoint is needed only if */ - /* UPSCALE and DOWNSCALE have any effect. */ - -#if ( PIXEL_BITS != 6 ) - mid_x = ( ras.x + to_x + -3 * UPSCALE( control1-x + control2-x ) ) / 8; - mid_y = ( ras.y + to_y + -3 * UPSCALE( control1-y + control2-y ) ) / 8; -#endif - - gray_render_line( RAS_VAR_ mid_x, mid_y ); - gray_render_line( RAS_VAR_ to_x, to_y ); - - return; -} - arc = ras.bez_stack; arc[0].x = UPSCALE( to-x ); arc[0].y = UPSCALE( to-y ); @@ -1071,56 +1000,66 @@ typedef ptrdiff_t FT_PtrDist; arc[3].x = ras.x; arc[3].y = ras.y; -levels= ras.lev_stack; -top = 0; -levels[0] = level; - -while ( top = 0 ) +for (;;) { - level = levels[top]; - if ( level 1 ) - { -/* check
RE: [ft-devel] latest patch file for spline flattening
Graham, That's looking much closer to what I would have thought we needed; only splitting the curve when required. However, your fast heuristic can be very inaccurate. Consider P0: 0,0 P1: 5,5 P2: 95,5 P3: 100,0 The max deviation is 3.75 (0.75 * 5 since Hain's v == 1), but your heuristic gives a value of 45 - twelve times too great. Btw, I think that dx1, dy1, ... need to be of type TPos, not int. On the issue of avoiding square roots, a little bit of algebra shows that it is possible to estimate r = sqrt(dx^2 + dy^2) with a simple linear combination of dx and dy. For example, if an overestimate is required, and dx = dy, you can use r_upperbound = dx + (sqrt(2) - 1) * dy which overestimates by no more than 8.5%. For integer arithmetic, sqrt(2) - 1 can be (over)estimated by 107/256. For example, you could use this approximation to do something like this: dx1 = abs(control1-x - midx); dy1 = abs(control1-y - midy); if (dx1 = dy1) dr1 = dx1 + (107 * dy1 + 255) 8; else dr1 = dy1 + (107 * dx1 + 255) 8; dx2 = abs(control2-x - midx); dy2 = abs(control2-y - midy); if (dx2 = dy2) dr2 = dx2 + (107 * dy2 + 255) 8; else dr2 = dy2 + (107 * dx2 + 255) 8; /* deviation never exceeds 75% of control point dist */ if (dr1 = dr2) dev_max = (3 * dr1 + 3) 2; else dev_max = (3 * dr2 + 3) 2; if (dev_max = FT_MAX_CURVE_DEVIATION) ... Of course, this doesn't resolve the problem I mentioned above; for the example curve, this gives dev_max a value of 36 - still nine times too great. I hope to have something based a bit more closely on Hain's paper by the end of tomorrow. I may use something like the square-root avoiding estimate above to approximate his L value. David %^ -Original Message- From: freetype-devel-bounces+david.bevan=pb@nongnu.org [mailto:freetype-devel-bounces+david.bevan=pb@nongnu.org] On Behalf Of Graham Asher Sent: 6 September 2010 11:10 To: freetype-devel Subject: [ft-devel] latest patch file for spline flattening Here's a new version of my spline flattening patch. (I would like to be able to push this to the git repository but am having authentication problems; Werner has been helping me, but no success so far, probably because of my ineptitude in these matters.). The nub of the latest change is that I found that predicting the number of recursion levels is not reliable when splitting a cubic spline for flattening. A better way is to check the flatness inside the loop - using the fast heuristic of taking the maximum coordinate difference of a control point from the chord midpoint. This also makes the code simpler - and, surprisingly, faster, according to my benchmark; however, my benchmark is based on cartographic, not typographic, use, so will need confirmation. The patch obviously still solves the bug involving s-shaped curves (control points on both sides of the chord). Graham ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] latest patch file for spline flattening
David, in fact I coded up and tested a different version using an accurate calculation of the control point deviation, but it was slower than the version I am proposing. I'll try your version; and I would be grateful if you could also do some benchmarking, because obviously we are not trying to minimise the number of straight line segments the curve is dissected into, but the overall speed. My aims, and my benchmarking, are rather biased, because my main use of cubic splines is for approximations to arcs of circles, used as parts of path envelopes when drawing maps. Graham David Bevan wrote: Graham, That's looking much closer to what I would have thought we needed; only splitting the curve when required. However, your fast heuristic can be very inaccurate. Consider P0: 0,0 P1: 5,5 P2: 95,5 P3: 100,0 The max deviation is 3.75 (0.75 * 5 since Hain's v == 1), but your heuristic gives a value of 45 - twelve times too great. Btw, I think that dx1, dy1, ... need to be of type TPos, not int. On the issue of avoiding square roots, a little bit of algebra shows that it is possible to estimate r = sqrt(dx^2 + dy^2) with a simple linear combination of dx and dy. For example, if an overestimate is required, and dx = dy, you can use r_upperbound = dx + (sqrt(2) - 1) * dy which overestimates by no more than 8.5%. For integer arithmetic, sqrt(2) - 1 can be (over)estimated by 107/256. For example, you could use this approximation to do something like this: dx1 = abs(control1-x - midx); dy1 = abs(control1-y - midy); if (dx1 = dy1) dr1 = dx1 + (107 * dy1 + 255) 8; else dr1 = dy1 + (107 * dx1 + 255) 8; dx2 = abs(control2-x - midx); dy2 = abs(control2-y - midy); if (dx2 = dy2) dr2 = dx2 + (107 * dy2 + 255) 8; else dr2 = dy2 + (107 * dx2 + 255) 8; /* deviation never exceeds 75% of control point dist */ if (dr1 = dr2) dev_max = (3 * dr1 + 3) 2; else dev_max = (3 * dr2 + 3) 2; if (dev_max = FT_MAX_CURVE_DEVIATION) ... Of course, this doesn't resolve the problem I mentioned above; for the example curve, this gives dev_max a value of 36 - still nine times too great. I hope to have something based a bit more closely on Hain's paper by the end of tomorrow. I may use something like the square-root avoiding estimate above to approximate his L value. David %^ -Original Message- From: freetype-devel-bounces+david.bevan=pb@nongnu.org [mailto:freetype-devel-bounces+david.bevan=pb@nongnu.org] On Behalf Of Graham Asher Sent: 6 September 2010 11:10 To: freetype-devel Subject: [ft-devel] latest patch file for spline flattening Here's a new version of my spline flattening patch. (I would like to be able to push this to the git repository but am having authentication problems; Werner has been helping me, but no success so far, probably because of my ineptitude in these matters.). The nub of the latest change is that I found that predicting the number of recursion levels is not reliable when splitting a cubic spline for flattening. A better way is to check the flatness inside the loop - using the fast heuristic of taking the maximum coordinate difference of a control point from the chord midpoint. This also makes the code simpler - and, surprisingly, faster, according to my benchmark; however, my benchmark is based on cartographic, not typographic, use, so will need confirmation. The patch obviously still solves the bug involving s-shaped curves (control points on both sides of the chord). Graham ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
RE: [ft-devel] latest patch file for spline flattening
I'll do that. I wonder how much of the cost of FT_Load_Glyph is actually spent in gray_render_cubic and how much impact reducing the number of line segments has on later phases of the rendering process. I'm also trying to see if I can come up with a heuristic that is both fast (i.e. simple linear combinations of positions) and accurate (i.e. always correct and any overestimate is reasonably bounded). Then we may be able to avoid the complexities of Hain's approach. We so seem to be converging towards something that meets all the requirements. I probably won't achieve much today, but hope to spend most of tomorrow on this. David %^ -Original Message- From: Graham Asher [mailto:graham.as...@btinternet.com] Sent: 6 September 2010 12:36 To: David Bevan Cc: freetype-devel Subject: Re: [ft-devel] latest patch file for spline flattening David, in fact I coded up and tested a different version using an accurate calculation of the control point deviation, but it was slower than the version I am proposing. I'll try your version; and I would be grateful if you could also do some benchmarking, because obviously we are not trying to minimise the number of straight line segments the curve is dissected into, but the overall speed. My aims, and my benchmarking, are rather biased, because my main use of cubic splines is for approximations to arcs of circles, used as parts of path envelopes when drawing maps. Graham David Bevan wrote: Graham, That's looking much closer to what I would have thought we needed; only splitting the curve when required. However, your fast heuristic can be very inaccurate. Consider P0: 0,0 P1: 5,5 P2: 95,5 P3: 100,0 The max deviation is 3.75 (0.75 * 5 since Hain's v == 1), but your heuristic gives a value of 45 - twelve times too great. Btw, I think that dx1, dy1, ... need to be of type TPos, not int. On the issue of avoiding square roots, a little bit of algebra shows that it is possible to estimate r = sqrt(dx^2 + dy^2) with a simple linear combination of dx and dy. For example, if an overestimate is required, and dx = dy, you can use r_upperbound = dx + (sqrt(2) - 1) * dy which overestimates by no more than 8.5%. For integer arithmetic, sqrt(2) - 1 can be (over)estimated by 107/256. For example, you could use this approximation to do something like this: dx1 = abs(control1-x - midx); dy1 = abs(control1-y - midy); if (dx1 = dy1) dr1 = dx1 + (107 * dy1 + 255) 8; else dr1 = dy1 + (107 * dx1 + 255) 8; dx2 = abs(control2-x - midx); dy2 = abs(control2-y - midy); if (dx2 = dy2) dr2 = dx2 + (107 * dy2 + 255) 8; else dr2 = dy2 + (107 * dx2 + 255) 8; /* deviation never exceeds 75% of control point dist */ if (dr1 = dr2) dev_max = (3 * dr1 + 3) 2; else dev_max = (3 * dr2 + 3) 2; if (dev_max = FT_MAX_CURVE_DEVIATION) ... Of course, this doesn't resolve the problem I mentioned above; for the example curve, this gives dev_max a value of 36 - still nine times too great. I hope to have something based a bit more closely on Hain's paper by the end of tomorrow. I may use something like the square-root avoiding estimate above to approximate his L value. David %^ -Original Message- From: freetype-devel-bounces+david.bevan=pb@nongnu.org [mailto:freetype-devel-bounces+david.bevan=pb@nongnu.org] On Behalf Of Graham Asher Sent: 6 September 2010 11:10 To: freetype-devel Subject: [ft-devel] latest patch file for spline flattening Here's a new version of my spline flattening patch. (I would like to be able to push this to the git repository but am having authentication problems; Werner has been helping me, but no success so far, probably because of my ineptitude in these matters.). The nub of the latest change is that I found that predicting the number of recursion levels is not reliable when splitting a cubic spline for flattening. A better way is to check the flatness inside the loop - using the fast heuristic of taking the maximum coordinate difference of a control point from the chord midpoint. This also makes the code simpler - and, surprisingly, faster, according to my benchmark; however, my benchmark is based on cartographic, not typographic, use, so will need confirmation. The patch obviously still solves the bug involving s-shaped curves (control points on both sides of the chord). Graham ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
GA == Graham Asher graham.as...@btinternet.com writes: GA Have you done an ARM version? Forgive my inattentiveness if you've GA already announced one. It just struck me that this sort of GA optimisation is even more necessary on mobile devices. I386, arm and arm-thumb versions were already there. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
MB == Miles Bader mi...@gnu.org writes: MB The compiler generates the following assembly: MB mov %esi, %eax MB mov %edi, %edi MB imulq %rdi, %rax MB addq$32768, %rax MB shrq$16, %rax That does not match the C code though; it rounds negative values wrong. The C version does away-from-zero rounding. Using the single arg version of imulq generates a 128 bit result; the more significant part of which will be 0 iff the product is =0 and will be -1 if the product is 0, given that the multiplicands were only 32 bits. Adding that, in addition to the 32768, to rax ensures that the result of the =16 is rounded the way freetype wants. If you use the two arg version of imul, you have to copy the msb of the result (or do a compare and jump, like the C code) to determine whether to add 0x8000 or 0x7FFF. Matching the rounding was the hardest part; noting that the upper 64 bits of the 128-bit product would always be just sign-extension bits and that, because of the prototype of FT_MulFix() itself, the vaules are already promoted to 64 bits before they get to the assembly were what provided the most (in-order) speedups. If it can be done better, though, I'd be happy to know! Thanks for also looking at it. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
On Tue, Sep 7, 2010 at 4:28 AM, James Cloos cl...@jhcloos.com wrote: MB == Miles Bader mi...@gnu.org writes: MB The compiler generates the following assembly: MB mov %esi, %eax MB mov %edi, %edi MB imulq %rdi, %rax MB addq $32768, %rax MB shrq $16, %rax That does not match the C code though; it rounds negative values wrong. The C version does away-from-zero rounding. Do you have test cases that show this? I tried using random inputs, but even up to billions of iterations, I can't seem to find a set of inputs where my function yields different results from yours. Thanks, -Miles -- Cat is power. Cat is peace. ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel
Re: [ft-devel] FT_MulFix assembly
MB == Miles Bader mi...@gnu.org writes: The C version does away-from-zero rounding. MB Do you have test cases that show this? I tried using random inputs, MB but even up to billions of iterations, I can't seem to find a set of MB inputs where my function yields different results from yours. The C version saves the two signs, takes the absolute values, multiplies, scales and then sets the sign. When I tested, I used dd(1) to generate a quarter-gig file from urandom (I used a fixed file so that it would be reproducable), mmap(2)ed that to an int[], and went through two at a time. The C and my initial asm versions produced different results whenever the second int was -1 (ie 0x) and the first matched: (a 0 (a 0x == 0x8000)). In other words, multiplying something like 7.5 by -1/65536. An example of that test's output was: 7AFA8000, , 8505, 8506, 0 In that example, 8505 is what the C version generates. -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6 ___ Freetype-devel mailing list Freetype-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/freetype-devel