Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Graham Asher
Have you done an ARM version? Forgive my inattentiveness if you've 
already announced one. It just struck me that this sort of optimisation 
is even more necessary on mobile devices.


Graham


James Cloos wrote:

The final result for amd64 looks like:

  static __inline__ long
  FT_MulFix_x86_64( long  a,
   long  b )
  {
register long  result;

__asm__ __volatile__ (
  movq  %1, %%rax\n
  imul  %2\n
  addq  %%rdx, %%rax\n
  addq  $0x8000, %%rax\n
  sarq  $16, %%rax\n
  : =a(result)
  : g(a), g(b)
  : rdx );
return result;
  }


The use of long, though requires review.  The C version uses FT_Long
(not FT_Int32 like the other asm versions), but FT_Long is not a #define
or a typedef at the point where the asm version are located.

That said, using long there on amd64 prevents unnecessary 32-64 bit
conversions in the resulting code.

The above code has a latency of 1+5+1+1+1 = 10 clocks on an amdfam10 cpu.

The assembly generated by the C code is 45 lines and 158 octets long,
contains six conditional jumps, three each of explicit compares and
tests, and still benchmarks are just as fast.  Out-of-order processing
wins out over hand-coded asm. :-/

It *might* make more of a difference on an in-order processor like the
Arom.  But I do not have one to test.

I can still finish a patch, and have collected the info I need to do one
for mips64, too, where I expect it will be more important.  I also expect
that the i386 version could be tidied a bit.

Is the amd64 version desired, given how little benefit it has?

-JimC
  



___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] freetype.com

2010-09-06 Thread Miles Bader
phil song phils...@techtrex.com writes:
   Excuse me ,I don't understand why we need tell the slashdot.org people? 
  

It's just a quick way to generate a lot of publicity about monotype's
dirty tricks...

-Miles

-- 
We are all lying in the gutter, but some of us are looking at the stars.
-Oscar Wilde


___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
James Cloos cl...@jhcloos.com writes:
 __asm__ __volatile__ (
   movq  %1, %%rax\n
   imul  %2\n
   addq  %%rdx, %%rax\n
   addq  $0x8000, %%rax\n
   sarq  $16, %%rax\n
   : =a(result)
   : g(a), g(b)
   : rdx );

 The above code has a latency of 1+5+1+1+1 = 10 clocks on an amdfam10 cpu.
...
 Is the amd64 version desired, given how little benefit it has?

If this is being used in a context where it might benefit from more
scheduling, etc, perhaps it would help to let the compiler generate the
non-imul insns (since it's pretty good at those)?

E.g. something like:

  static __inline__ long
  FT_MulFix_x86_64 (long a, long b)
  {
register long  mr1, mr2;
__asm__ (imul  %3\n : =a (mr1), =d (mr2) : a (a), g (b));
return ((mr1 + mr2) + 0x8000)  16;
  }


-miles

-- 
Omochiroi!


___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
Incidentally, you wrote:

 The assembly generated by the C code is 45 lines and 158 octets long,
 contains six conditional jumps, three each of explicit compares and
 tests, and still benchmarks are just as fast.  Out-of-order processing
 wins out over hand-coded asm. :-/

... but when I follow your original suggestion, and just do the following:

   typedef long s64;
   typedef unsigned int u32;

   static __inline__ u32
   FT_MulFix_x86_64 (u32 a, u32 b)
   {
 return (((s64)a * (s64)b) + 0x8000)  16;
   }

The compiler generates the following assembly:

mov %esi, %eax
mov %edi, %edi
imulq   %rdi, %rax
addq$32768, %rax
shrq$16, %rax

The movs there are obviously a bit silly (compiler bug?), but that
output seems reasonably close to the asm() version, and obviously much
more schedulable since the compiler knows what the insns do...

[I tried it with gcc-4.x for a few different xs, and the results are the
same, and optimization flags don't seem to make much difference either.]

-Miles

-- 
Scriptures, n. The sacred books of our holy religion, as distinguished from
the false and profane writings on which all other faiths are based.


___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
Miles Bader mi...@gnu.org writes:
 The compiler generates the following assembly:

   mov %esi, %eax
   mov %edi, %edi
   imulq   %rdi, %rax
   addq$32768, %rax
   shrq$16, %rax

 The movs there are obviously a bit silly (compiler bug?), but that
 output seems reasonably close to the asm() version, and obviously much
 more schedulable since the compiler knows what the insns do...

[...Actually the redundant movs, while they may be a compiler bug, seem
to be caused by the way I structured my test program; in a more
realistic circumstance I don't see them.]

-miles

-- 
Abstainer, n. A weak person who yields to the temptation of denying himself a
pleasure. A total abstainer is one who abstains from everything but
abstention, and especially from inactivity in the affairs of others.


___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


[ft-devel] latest patch file for spline flattening

2010-09-06 Thread Graham Asher
Here's a new version of my spline flattening patch. (I would like to be 
able to push this to the git repository but am having authentication 
problems; Werner has been helping me, but no success so far, probably 
because of my ineptitude in these matters.).


The nub of the latest change is that I found that predicting the number 
of recursion levels is not reliable when splitting a cubic spline for 
flattening. A better way is to check the flatness inside the loop - 
using the fast heuristic of taking the maximum coordinate difference of 
a control point from the chord midpoint. This also makes the code 
simpler - and, surprisingly, faster, according to my benchmark; however, 
my benchmark is based on cartographic, not typographic, use, so will 
need confirmation.


The patch obviously still solves the bug involving s-shaped curves 
(control points on both sides of the chord).


Graham

diff --git a/C:\\DOCUME~1\\Graham\\LOCALS~1\\Temp\\ftgrays_c8f5b9.c 
b/C:\\FreeType\\freetype2\\src\\smooth\\ftgrays.c
index 0b94143..3a5e3cf 100644
--- a/C:\\DOCUME~1\\Graham\\LOCALS~1\\Temp\\ftgrays_c8f5b9.c
+++ b/C:\\FreeType\\freetype2\\src\\smooth\\ftgrays.c
@@ -90,6 +90,9 @@
 #undef  FT_COMPONENT
 #define FT_COMPONENT  trace_smooth
 
+/* The maximum distance of a curve from the chord, in 64ths of a pixel; */
+/* used when flattening curves. */
+#define FT_MAX_CURVE_DEVIATION 16
 
 #ifdef _STANDALONE_
 
@@ -354,8 +357,6 @@ typedef ptrdiff_t  FT_PtrDist;
 
 int  band_size;
 int  band_shoot;
-int  conic_level;
-int  cubic_level;
 
 ft_jmp_buf  jump_buffer;
 
@@ -888,31 +889,18 @@ typedef ptrdiff_t  FT_PtrDist;
 if ( dx  dy )
   dx = dy;
 
-level = 1;
-dx = dx / ras.conic_level;
-while ( dx  0 )
+if ( dx = FT_MAX_CURVE_DEVIATION )
 {
-  dx = 2;
-  level++;
+  gray_render_line( RAS_VAR_ UPSCALE( to-x ), UPSCALE( to-y ) );
+  return;
 }
 
-/* a shortcut to speed things up */
-if ( level = 1 )
+level = 1;
+dx /= FT_MAX_CURVE_DEVIATION;
+while ( dx  1 )
 {
-  /* we compute the mid-point directly in order to avoid */
-  /* calling gray_split_conic()  */
-  TPos  to_x, to_y, mid_x, mid_y;
-
-
-  to_x  = UPSCALE( to-x );
-  to_y  = UPSCALE( to-y );
-  mid_x = ( ras.x + to_x + 2 * UPSCALE( control-x ) ) / 4;
-  mid_y = ( ras.y + to_y + 2 * UPSCALE( control-y ) ) / 4;
-
-  gray_render_line( RAS_VAR_ mid_x, mid_y );
-  gray_render_line( RAS_VAR_ to_x, to_y );
-
-  return;
+  dx = 2;
+  level++;
 }
 
 arc   = ras.bez_stack;
@@ -957,21 +945,9 @@ typedef ptrdiff_t  FT_PtrDist;
   }
 
 Draw:
-  {
-TPos  to_x, to_y, mid_x, mid_y;
-
-
-to_x  = arc[0].x;
-to_y  = arc[0].y;
-mid_x = ( ras.x + to_x + 2 * arc[1].x ) / 4;
-mid_y = ( ras.y + to_y + 2 * arc[1].y ) / 4;
-
-gray_render_line( RAS_VAR_ mid_x, mid_y );
-gray_render_line( RAS_VAR_ to_x, to_y );
-
-top--;
-arc -= 2;
-  }
+  gray_render_line( RAS_VAR_ arc[0].x, arc[0].y );
+  top--;
+  arc -= 2;
 }
 
 return;
@@ -1011,56 +987,9 @@ typedef ptrdiff_t  FT_PtrDist;
   const FT_Vector*  control2,
   const FT_Vector*  to )
   {
-int top, level;
-int*levels;
 FT_Vector*  arc;
-int mid_x = ( DOWNSCALE( ras.x ) + to-x +
-  3 * (control1-x + control2-x ) ) / 8;
-int mid_y = ( DOWNSCALE( ras.y ) + to-y +
-  3 * (control1-y + control2-y ) ) / 8;
-TPosdx = DOWNSCALE( ras.x ) + to-x - ( mid_x  1 );
-TPosdy = DOWNSCALE( ras.y ) + to-y - ( mid_y  1 );
 
 
-if ( dx  0 )
-  dx = -dx;
-if ( dy  0 )
-  dy = -dy;
-if ( dx  dy )
-  dx = dy;
-
-level = 1;
-dx /= ras.cubic_level;
-while ( dx  0 )
-{
-  dx = 2;
-  level++;
-}
-
-if ( level = 1 )
-{
-  TPos  to_x, to_y;
-
-
-  to_x  = UPSCALE( to-x );
-  to_y  = UPSCALE( to-y );
-
-  /* Recalculation of midpoint is needed only if */
-  /* UPSCALE and DOWNSCALE have any effect.  */
-
-#if ( PIXEL_BITS != 6 )
-  mid_x = ( ras.x + to_x +
-3 * UPSCALE( control1-x + control2-x ) ) / 8;
-  mid_y = ( ras.y + to_y +
-3 * UPSCALE( control1-y + control2-y ) ) / 8;
-#endif
-
-  gray_render_line( RAS_VAR_ mid_x, mid_y );
-  gray_render_line( RAS_VAR_ to_x, to_y );
-
-  return;
-}
-
 arc  = ras.bez_stack;
 arc[0].x = UPSCALE( to-x );
 arc[0].y = UPSCALE( to-y );
@@ -1071,56 +1000,66 @@ typedef ptrdiff_t  FT_PtrDist;
 arc[3].x = ras.x;
 arc[3].y = ras.y;
 
-levels= ras.lev_stack;
-top   = 0;
-levels[0] = level;
-
-while ( top = 0 )
+for (;;)
 {
-  level = levels[top];
-  if ( level  1 )
-  {
-/* check 

RE: [ft-devel] latest patch file for spline flattening

2010-09-06 Thread David Bevan
Graham,

That's looking much closer to what I would have thought we needed; only 
splitting the curve when required. However, your fast heuristic can be very 
inaccurate.

Consider

  P0: 0,0
  P1: 5,5
  P2: 95,5
  P3: 100,0

The max deviation is 3.75 (0.75 * 5 since Hain's v == 1), but your heuristic 
gives a value of 45 - twelve times too great.


Btw, I think that dx1, dy1, ... need to be of type TPos, not int.


On the issue of avoiding square roots, a little bit of algebra shows that it is 
possible to estimate r = sqrt(dx^2 + dy^2) with a simple linear combination of 
dx and dy.

For example, if an overestimate is required, and dx = dy, you can use

  r_upperbound = dx + (sqrt(2) - 1) * dy

which overestimates by no more than 8.5%.

For integer arithmetic, sqrt(2) - 1 can be (over)estimated by 107/256.

For example, you could use this approximation to do something like this:

  dx1 = abs(control1-x - midx);
  dy1 = abs(control1-y - midy);
  if (dx1 = dy1)
dr1 = dx1 + (107 * dy1 + 255)  8;
  else
dr1 = dy1 + (107 * dx1 + 255)  8;

  dx2 = abs(control2-x - midx);
  dy2 = abs(control2-y - midy);
  if (dx2 = dy2)
dr2 = dx2 + (107 * dy2 + 255)  8;
  else
dr2 = dy2 + (107 * dx2 + 255)  8;  

  /* deviation never exceeds 75% of control point dist */
  if (dr1 = dr2)
dev_max = (3 * dr1 + 3)  2;
  else
dev_max = (3 * dr2 + 3)  2;

  if (dev_max = FT_MAX_CURVE_DEVIATION)
...


Of course, this doesn't resolve the problem I mentioned above; for the example 
curve, this gives dev_max a value of 36 - still nine times too great.


I hope to have something based a bit more closely on Hain's paper by the end of 
tomorrow. I may use something like the square-root avoiding estimate above to 
approximate his L value.


David %^


 -Original Message-
 From: freetype-devel-bounces+david.bevan=pb@nongnu.org
 [mailto:freetype-devel-bounces+david.bevan=pb@nongnu.org] On Behalf Of
 Graham Asher
 Sent: 6 September 2010 11:10
 To: freetype-devel
 Subject: [ft-devel] latest patch file for spline flattening
 
 Here's a new version of my spline flattening patch. (I would like to be
 able to push this to the git repository but am having authentication
 problems; Werner has been helping me, but no success so far, probably
 because of my ineptitude in these matters.).
 
 The nub of the latest change is that I found that predicting the number
 of recursion levels is not reliable when splitting a cubic spline for
 flattening. A better way is to check the flatness inside the loop -
 using the fast heuristic of taking the maximum coordinate difference of
 a control point from the chord midpoint. This also makes the code
 simpler - and, surprisingly, faster, according to my benchmark; however,
 my benchmark is based on cartographic, not typographic, use, so will
 need confirmation.
 
 The patch obviously still solves the bug involving s-shaped curves
 (control points on both sides of the chord).
 
 Graham


___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] latest patch file for spline flattening

2010-09-06 Thread Graham Asher

David,

in fact I coded up and tested a different version using an accurate 
calculation of the control point deviation, but it was slower than the 
version I am proposing. I'll try your version; and I would be grateful 
if you could also do some benchmarking, because obviously we are not 
trying to minimise the number of straight line segments the curve is 
dissected into, but the overall speed. My aims, and my benchmarking, are 
rather biased, because my main use of cubic splines is for 
approximations to arcs of circles, used as parts of path envelopes when 
drawing maps.


Graham


David Bevan wrote:

Graham,

That's looking much closer to what I would have thought we needed; only splitting the 
curve when required. However, your fast heuristic can be very inaccurate.

Consider

  P0: 0,0
  P1: 5,5
  P2: 95,5
  P3: 100,0

The max deviation is 3.75 (0.75 * 5 since Hain's v == 1), but your heuristic 
gives a value of 45 - twelve times too great.


Btw, I think that dx1, dy1, ... need to be of type TPos, not int.


On the issue of avoiding square roots, a little bit of algebra shows that it is 
possible to estimate r = sqrt(dx^2 + dy^2) with a simple linear combination of 
dx and dy.

For example, if an overestimate is required, and dx = dy, you can use

  r_upperbound = dx + (sqrt(2) - 1) * dy

which overestimates by no more than 8.5%.

For integer arithmetic, sqrt(2) - 1 can be (over)estimated by 107/256.

For example, you could use this approximation to do something like this:

  dx1 = abs(control1-x - midx);
  dy1 = abs(control1-y - midy);
  if (dx1 = dy1)
dr1 = dx1 + (107 * dy1 + 255)  8;
  else
dr1 = dy1 + (107 * dx1 + 255)  8;

  dx2 = abs(control2-x - midx);
  dy2 = abs(control2-y - midy);
  if (dx2 = dy2)
dr2 = dx2 + (107 * dy2 + 255)  8;
  else
dr2 = dy2 + (107 * dx2 + 255)  8;  


  /* deviation never exceeds 75% of control point dist */
  if (dr1 = dr2)
dev_max = (3 * dr1 + 3)  2;
  else
dev_max = (3 * dr2 + 3)  2;

  if (dev_max = FT_MAX_CURVE_DEVIATION)
...


Of course, this doesn't resolve the problem I mentioned above; for the example 
curve, this gives dev_max a value of 36 - still nine times too great.


I hope to have something based a bit more closely on Hain's paper by the end of 
tomorrow. I may use something like the square-root avoiding estimate above to 
approximate his L value.


David %^


  

-Original Message-
From: freetype-devel-bounces+david.bevan=pb@nongnu.org
[mailto:freetype-devel-bounces+david.bevan=pb@nongnu.org] On Behalf Of
Graham Asher
Sent: 6 September 2010 11:10
To: freetype-devel
Subject: [ft-devel] latest patch file for spline flattening

Here's a new version of my spline flattening patch. (I would like to be
able to push this to the git repository but am having authentication
problems; Werner has been helping me, but no success so far, probably
because of my ineptitude in these matters.).

The nub of the latest change is that I found that predicting the number
of recursion levels is not reliable when splitting a cubic spline for
flattening. A better way is to check the flatness inside the loop -
using the fast heuristic of taking the maximum coordinate difference of
a control point from the chord midpoint. This also makes the code
simpler - and, surprisingly, faster, according to my benchmark; however,
my benchmark is based on cartographic, not typographic, use, so will
need confirmation.

The patch obviously still solves the bug involving s-shaped curves
(control points on both sides of the chord).

Graham




  



___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


RE: [ft-devel] latest patch file for spline flattening

2010-09-06 Thread David Bevan

I'll do that. I wonder how much of the cost of FT_Load_Glyph is actually spent 
in gray_render_cubic and how much impact reducing the number of line segments 
has on later phases of the rendering process.

I'm also trying to see if I can come up with a heuristic that is both fast 
(i.e. simple linear combinations of positions) and accurate (i.e. always 
correct and any overestimate is reasonably bounded). Then we may be able to 
avoid the complexities of Hain's approach.

We so seem to be converging towards something that meets all the requirements.

I probably won't achieve much today, but hope to spend most of tomorrow on this.

David %^


 -Original Message-
 From: Graham Asher [mailto:graham.as...@btinternet.com]
 Sent: 6 September 2010 12:36
 To: David Bevan
 Cc: freetype-devel
 Subject: Re: [ft-devel] latest patch file for spline flattening
 
 David,
 
 in fact I coded up and tested a different version using an accurate
 calculation of the control point deviation, but it was slower than the
 version I am proposing. I'll try your version; and I would be grateful
 if you could also do some benchmarking, because obviously we are not
 trying to minimise the number of straight line segments the curve is
 dissected into, but the overall speed. My aims, and my benchmarking, are
 rather biased, because my main use of cubic splines is for
 approximations to arcs of circles, used as parts of path envelopes when
 drawing maps.
 
 Graham
 
 
 David Bevan wrote:
  Graham,
 
  That's looking much closer to what I would have thought we needed; only
 splitting the curve when required. However, your fast heuristic can be
 very inaccurate.
 
  Consider
 
P0: 0,0
P1: 5,5
P2: 95,5
P3: 100,0
 
  The max deviation is 3.75 (0.75 * 5 since Hain's v == 1), but your
 heuristic gives a value of 45 - twelve times too great.
 
 
  Btw, I think that dx1, dy1, ... need to be of type TPos, not int.
 
 
  On the issue of avoiding square roots, a little bit of algebra shows
 that it is possible to estimate r = sqrt(dx^2 + dy^2) with a simple linear
 combination of dx and dy.
 
  For example, if an overestimate is required, and dx = dy, you can use
 
r_upperbound = dx + (sqrt(2) - 1) * dy
 
  which overestimates by no more than 8.5%.
 
  For integer arithmetic, sqrt(2) - 1 can be (over)estimated by 107/256.
 
  For example, you could use this approximation to do something like this:
 
dx1 = abs(control1-x - midx);
dy1 = abs(control1-y - midy);
if (dx1 = dy1)
  dr1 = dx1 + (107 * dy1 + 255)  8;
else
  dr1 = dy1 + (107 * dx1 + 255)  8;
 
dx2 = abs(control2-x - midx);
dy2 = abs(control2-y - midy);
if (dx2 = dy2)
  dr2 = dx2 + (107 * dy2 + 255)  8;
else
  dr2 = dy2 + (107 * dx2 + 255)  8;
 
/* deviation never exceeds 75% of control point dist */
if (dr1 = dr2)
  dev_max = (3 * dr1 + 3)  2;
else
  dev_max = (3 * dr2 + 3)  2;
 
if (dev_max = FT_MAX_CURVE_DEVIATION)
  ...
 
 
  Of course, this doesn't resolve the problem I mentioned above; for the
 example curve, this gives dev_max a value of 36 - still nine times too
 great.
 
 
  I hope to have something based a bit more closely on Hain's paper by the
 end of tomorrow. I may use something like the square-root avoiding
 estimate above to approximate his L value.
 
 
  David %^
 
 
 
  -Original Message-
  From: freetype-devel-bounces+david.bevan=pb@nongnu.org
  [mailto:freetype-devel-bounces+david.bevan=pb@nongnu.org] On Behalf
 Of
  Graham Asher
  Sent: 6 September 2010 11:10
  To: freetype-devel
  Subject: [ft-devel] latest patch file for spline flattening
 
  Here's a new version of my spline flattening patch. (I would like to be
  able to push this to the git repository but am having authentication
  problems; Werner has been helping me, but no success so far, probably
  because of my ineptitude in these matters.).
 
  The nub of the latest change is that I found that predicting the number
  of recursion levels is not reliable when splitting a cubic spline for
  flattening. A better way is to check the flatness inside the loop -
  using the fast heuristic of taking the maximum coordinate difference of
  a control point from the chord midpoint. This also makes the code
  simpler - and, surprisingly, faster, according to my benchmark;
 however,
  my benchmark is based on cartographic, not typographic, use, so will
  need confirmation.
 
  The patch obviously still solves the bug involving s-shaped curves
  (control points on both sides of the chord).
 
  Graham
 
 
 
 
 


___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread James Cloos
 GA == Graham Asher graham.as...@btinternet.com writes:

GA Have you done an ARM version? Forgive my inattentiveness if you've
GA already announced one. It just struck me that this sort of
GA optimisation is even more necessary on mobile devices.

I386, arm and arm-thumb versions were already there.

-JimC
-- 
James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6

___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread James Cloos
 MB == Miles Bader mi...@gnu.org writes:

MB The compiler generates the following assembly:

MB mov %esi, %eax
MB mov %edi, %edi
MB imulq   %rdi, %rax
MB addq$32768, %rax
MB shrq$16, %rax

That does not match the C code though; it rounds negative values wrong.

The C version does away-from-zero rounding.

Using the single arg version of imulq generates a 128 bit result; the
more significant part of which will be 0 iff the product is =0 and
will be -1 if the product is 0, given that the multiplicands were
only 32 bits.  Adding that, in addition to the 32768, to rax ensures
that the result of the =16 is rounded the way freetype wants.

If you use the two arg version of imul, you have to copy the msb of the
result (or do a compare and jump, like the C code) to determine whether
to add 0x8000 or 0x7FFF.

Matching the rounding was the hardest part; noting that the upper 64
bits of the 128-bit product would always be just sign-extension bits
and that, because of the prototype of FT_MulFix() itself, the vaules
are already promoted to 64 bits before they get to the assembly were
what provided the most (in-order) speedups.

If it can be done better, though, I'd be happy to know!

Thanks for also looking at it.

-JimC
-- 
James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6

___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
On Tue, Sep 7, 2010 at 4:28 AM, James Cloos cl...@jhcloos.com wrote:
 MB == Miles Bader mi...@gnu.org writes:

 MB The compiler generates the following assembly:

 MB     mov     %esi, %eax
 MB     mov     %edi, %edi
 MB     imulq   %rdi, %rax
 MB     addq    $32768, %rax
 MB     shrq    $16, %rax

 That does not match the C code though; it rounds negative values wrong.

 The C version does away-from-zero rounding.

Do you have test cases that show this?  I tried using random inputs,
but even up to billions of iterations, I can't seem to find a set of
inputs where my function yields different results from yours.

Thanks,

-Miles


-- 
Cat is power.  Cat is peace.

___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread James Cloos
 MB == Miles Bader mi...@gnu.org writes:

 The C version does away-from-zero rounding.

MB Do you have test cases that show this?  I tried using random inputs,
MB but even up to billions of iterations, I can't seem to find a set of
MB inputs where my function yields different results from yours.

The C version saves the two signs, takes the absolute values,
multiplies, scales and then sets the sign.

When I tested, I used dd(1) to generate a quarter-gig file from urandom
(I used a fixed file so that it would be reproducable), mmap(2)ed that
to an int[], and went through two at a time.  The C and my initial asm
versions produced different results whenever the second int was -1 (ie
0x) and the first matched: (a  0  (a  0x == 0x8000)).

In other words, multiplying something like 7.5 by -1/65536.

An example of that test's output was:

  7AFA8000, , 8505, 8506, 0

In that example, 8505 is what the C version generates.

-JimC
-- 
James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6

___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel