Re: [ft-devel] FT_MulFix assembly

2010-09-19 Thread Werner LEMBERG
 Werner:  Miles' version is shorter, is only wrong by one ulp and only
  when the product overflows and is negative.  My variation,
  called another() above, fixes that slight difference.
  Which would you prefer, if anything?

I tend to prefer the faster one $(Q#|(B IIRC, we never rely on product
overflow of 32bit entities.

  I can write a patch which uses it only when it works (64
  bit sizeof(long)).

Please do so.


Werner

___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-07 Thread Miles Bader
James Cloos cl...@jhcloos.com writes:
 The C version does away-from-zero rounding.

 MB Do you have test cases that show this?  I tried using random inputs,
 MB but even up to billions of iterations, I can't seem to find a set of
 MB inputs where my function yields different results from yours.

 The C version saves the two signs, takes the absolute values,
 multiplies, scales and then sets the sign.

 When I tested, I used dd(1) to generate a quarter-gig file from urandom
 (I used a fixed file so that it would be reproducable), mmap(2)ed that
 to an int[], and went through two at a time.  The C and my initial asm
 versions produced different results whenever the second int was -1 (ie
 0x) and the first matched: (a  0  (a  0x == 0x8000)).

 In other words, multiplying something like 7.5 by -1/65536.

 An example of that test's output was:

   7AFA8000, , 8505, 8506, 0

 In that example, 8505 is what the C version generates.

Hm, are you sure that's not backwards?  When I tried the git C version[*],
as well as your most recent FT_MulFix_x86_64, it returned 0x8506...


The following C version:

   typedef signed int FT_Int;
   typedef signed long FT_Long;
   typedef signed long FT_Int64; /* on x86-64 */

   FT_Long
   FT_MulFix_C_new( FT_Long  a,
FT_Long  b )
   {
 return (((FT_Int64)a * (FT_Int64)b) + 0x8000)  16;
   }

... generates this code:

imulq   %rsi, %rdi
leaq32768(%rdi), %rax
sarq$16, %rax


It seems to yield exactly the same results as the offical C version[*],
both for your test case:

   $ ./t 0x7AFA8000 0x
   0x7afa8000 x 0x =
   C: 0x7fff8506
   C_new: 0x7fff8506
 asm: 0x7fff8506

... and also for billions of random inputs.


Is there something I'm missing...?

[*] Fetched from:
http://git.savannah.gnu.org/cgit/freetype/freetype2.git/tree/src/base/ftcalc.c


Thanks,

-Miles

-- 
Liberty, n. One of imagination's most precious possessions.


___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-07 Thread James Cloos
 MB == Miles Bader mi...@gnu.org writes:

MB Hm, are you sure that's not backwards?  When I tried the git C version[*],
MB as well as your most recent FT_MulFix_x86_64, it returned 0x8506...


Odd.  Adding your algo to my test app, I get:

  7AFA8000, , 8505, 8505, 8506
 #a   , b   ,FT   ,JC   ,MB

I see that I have one small error in the C code in my app.

FT has:

c = (FT_Long)( ( (FT_Int64)a * b + 0x8000L )  16 );

whereas I used:

c = (int32_t)(((int64_t)a*b + 0x8000L)  16);

But changing the int32_t to long does not change the results.

Yours still is always +1 compared to the C, whenever the first arg
represents a positive value with fractional part == 1/2.

Oddly, though, gcc now refuses to compile my asm, even though it did do
so before, complaining that I cannot guess what arg size to use for the
imul  Wierd.  (The existing executables prove that it used to.)
A simple way around that is to specify D and S as the contraints
for a and b.  (The rdi and rsi regesters are where the x86_64 abi puts
the first two args which are passed to a function.)

The disassembly of the final version is:

004006c0 mf:
  4006c0:   48 89 f8mov%rdi,%rax
  4006c3:   48 f7 eeimul   %rsi
  4006c6:   48 01 d0add%rdx,%rax
  4006c9:   48 05 00 80 00 00   add$0x8000,%rax
  4006cf:   48 c1 f8 10 sar$0x10,%rax
  4006d3:   c3  retq   

And I get this disassembly of yours:

00400840 miles:
  400840:   48 63 c6movslq %esi,%rax
  400843:   48 63 ffmovslq %edi,%rdi
  400846:   48 0f af c7 imul   %rdi,%rax
  40084a:   48 05 00 80 00 00   add$0x8000,%rax
  400850:   48 c1 f8 10 sar$0x10,%rax
  400854:   c3  retq   

I also just added this version to my test app:

int another (int32_t a, int32_t b) {
long r = (long)a * (long)b;
long s = r  31;
return (r + s + 0x8000)  16;
}

That results in:

00400760 another:
  400760:   48 63 ffmovslq %edi,%rdi
  400763:   48 63 f6movslq %esi,%rsi
  400766:   48 0f af f7 imul   %rdi,%rsi
  40076a:   48 89 f0mov%rsi,%rax
  40076d:   48 c1 f8 1f sar$0x1f,%rax
  400771:   48 8d 84 06 00 80 00lea0x8000(%rsi,%rax,1),%rax
  400778:   00 
  400779:   48 c1 f8 10 sar$0x10,%rax
  40077d:   c3  retq   

Since FT's C version uses longs, though, this:

int another (long a, long b) {
long r = (long)a * (long)b;
long s = r  31;
return (r + s + 0x8000)  16;
}

gives:

00400760 another:
  400760:   48 0f af f7 imul   %rdi,%rsi
  400764:   48 89 f0mov%rsi,%rax
  400767:   48 c1 f8 1f sar$0x1f,%rax
  40076b:   48 8d 84 06 00 80 00lea0x8000(%rsi,%rax,1),%rax
  400772:   00 
  400773:   48 c1 f8 10 sar$0x10,%rax
  400777:   c3  retq   

So it would seem that when compiling for any processor where FT_Long is
the same as int64_t and where that fits into a single register, then
that last bit of C might be optimal, yes?

-JimC
-- 
James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6

___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-07 Thread Miles Bader
James Cloos cl...@jhcloos.com writes:
 Since FT's C version uses longs, though, this:

 int another (long a, long b) {
 long r = (long)a * (long)b;
 long s = r  31;
 return (r + s + 0x8000)  16;
 }

That's not correct though, is it?  The variable s should be the all
sign portion of the multiplication, but since the two inputs have 32
significant bits (never mind the types), the product will have 64
significant bits.  So r  31 won't be all-sign, it'll be a bunch of
... other bits. :)

However, changing the shift to 63:

   FT_Long
   FT_MulFix_C_new2( FT_Long  a,
 FT_Long  b )
   {
 FT_Int64 prod = (FT_Int64)a * (FT_Int64)b;
 FT_Int64 sign = prod  63;
 return ((prod + sign + 0x8000)  16);
   }

... does seem to yield correct results:

   $ ./t 0x7AFA8000 0x
   0x7afa8000 x 0x =
   C: 0x8505
   C_new: 0x8505
   C_nw2: 0x8505
   C_ano: 0x8505
 asm: 0x8505

C is the old C, C_new was my previous attempt, C_nw2 is the above
FT_MulFix_C_new2 function, C_ano is the another function, and
asm was your final asm version.

[another yields misleadingly correct results in this case, because of
the particular argument values given; in other cases, it gives
incorrect results.]

-miles

-- 
Discriminate, v.i. To note the particulars in which one person or thing is,
if possible, more objectionable than another.


___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Graham Asher
Have you done an ARM version? Forgive my inattentiveness if you've 
already announced one. It just struck me that this sort of optimisation 
is even more necessary on mobile devices.


Graham


James Cloos wrote:

The final result for amd64 looks like:

  static __inline__ long
  FT_MulFix_x86_64( long  a,
   long  b )
  {
register long  result;

__asm__ __volatile__ (
  movq  %1, %%rax\n
  imul  %2\n
  addq  %%rdx, %%rax\n
  addq  $0x8000, %%rax\n
  sarq  $16, %%rax\n
  : =a(result)
  : g(a), g(b)
  : rdx );
return result;
  }


The use of long, though requires review.  The C version uses FT_Long
(not FT_Int32 like the other asm versions), but FT_Long is not a #define
or a typedef at the point where the asm version are located.

That said, using long there on amd64 prevents unnecessary 32-64 bit
conversions in the resulting code.

The above code has a latency of 1+5+1+1+1 = 10 clocks on an amdfam10 cpu.

The assembly generated by the C code is 45 lines and 158 octets long,
contains six conditional jumps, three each of explicit compares and
tests, and still benchmarks are just as fast.  Out-of-order processing
wins out over hand-coded asm. :-/

It *might* make more of a difference on an in-order processor like the
Arom.  But I do not have one to test.

I can still finish a patch, and have collected the info I need to do one
for mips64, too, where I expect it will be more important.  I also expect
that the i386 version could be tidied a bit.

Is the amd64 version desired, given how little benefit it has?

-JimC
  



___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
James Cloos cl...@jhcloos.com writes:
 __asm__ __volatile__ (
   movq  %1, %%rax\n
   imul  %2\n
   addq  %%rdx, %%rax\n
   addq  $0x8000, %%rax\n
   sarq  $16, %%rax\n
   : =a(result)
   : g(a), g(b)
   : rdx );

 The above code has a latency of 1+5+1+1+1 = 10 clocks on an amdfam10 cpu.
...
 Is the amd64 version desired, given how little benefit it has?

If this is being used in a context where it might benefit from more
scheduling, etc, perhaps it would help to let the compiler generate the
non-imul insns (since it's pretty good at those)?

E.g. something like:

  static __inline__ long
  FT_MulFix_x86_64 (long a, long b)
  {
register long  mr1, mr2;
__asm__ (imul  %3\n : =a (mr1), =d (mr2) : a (a), g (b));
return ((mr1 + mr2) + 0x8000)  16;
  }


-miles

-- 
Omochiroi!


___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
Incidentally, you wrote:

 The assembly generated by the C code is 45 lines and 158 octets long,
 contains six conditional jumps, three each of explicit compares and
 tests, and still benchmarks are just as fast.  Out-of-order processing
 wins out over hand-coded asm. :-/

... but when I follow your original suggestion, and just do the following:

   typedef long s64;
   typedef unsigned int u32;

   static __inline__ u32
   FT_MulFix_x86_64 (u32 a, u32 b)
   {
 return (((s64)a * (s64)b) + 0x8000)  16;
   }

The compiler generates the following assembly:

mov %esi, %eax
mov %edi, %edi
imulq   %rdi, %rax
addq$32768, %rax
shrq$16, %rax

The movs there are obviously a bit silly (compiler bug?), but that
output seems reasonably close to the asm() version, and obviously much
more schedulable since the compiler knows what the insns do...

[I tried it with gcc-4.x for a few different xs, and the results are the
same, and optimization flags don't seem to make much difference either.]

-Miles

-- 
Scriptures, n. The sacred books of our holy religion, as distinguished from
the false and profane writings on which all other faiths are based.


___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
Miles Bader mi...@gnu.org writes:
 The compiler generates the following assembly:

   mov %esi, %eax
   mov %edi, %edi
   imulq   %rdi, %rax
   addq$32768, %rax
   shrq$16, %rax

 The movs there are obviously a bit silly (compiler bug?), but that
 output seems reasonably close to the asm() version, and obviously much
 more schedulable since the compiler knows what the insns do...

[...Actually the redundant movs, while they may be a compiler bug, seem
to be caused by the way I structured my test program; in a more
realistic circumstance I don't see them.]

-miles

-- 
Abstainer, n. A weak person who yields to the temptation of denying himself a
pleasure. A total abstainer is one who abstains from everything but
abstention, and especially from inactivity in the affairs of others.


___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread James Cloos
 GA == Graham Asher graham.as...@btinternet.com writes:

GA Have you done an ARM version? Forgive my inattentiveness if you've
GA already announced one. It just struck me that this sort of
GA optimisation is even more necessary on mobile devices.

I386, arm and arm-thumb versions were already there.

-JimC
-- 
James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6

___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread James Cloos
 MB == Miles Bader mi...@gnu.org writes:

MB The compiler generates the following assembly:

MB mov %esi, %eax
MB mov %edi, %edi
MB imulq   %rdi, %rax
MB addq$32768, %rax
MB shrq$16, %rax

That does not match the C code though; it rounds negative values wrong.

The C version does away-from-zero rounding.

Using the single arg version of imulq generates a 128 bit result; the
more significant part of which will be 0 iff the product is =0 and
will be -1 if the product is 0, given that the multiplicands were
only 32 bits.  Adding that, in addition to the 32768, to rax ensures
that the result of the =16 is rounded the way freetype wants.

If you use the two arg version of imul, you have to copy the msb of the
result (or do a compare and jump, like the C code) to determine whether
to add 0x8000 or 0x7FFF.

Matching the rounding was the hardest part; noting that the upper 64
bits of the 128-bit product would always be just sign-extension bits
and that, because of the prototype of FT_MulFix() itself, the vaules
are already promoted to 64 bits before they get to the assembly were
what provided the most (in-order) speedups.

If it can be done better, though, I'd be happy to know!

Thanks for also looking at it.

-JimC
-- 
James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6

___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread Miles Bader
On Tue, Sep 7, 2010 at 4:28 AM, James Cloos cl...@jhcloos.com wrote:
 MB == Miles Bader mi...@gnu.org writes:

 MB The compiler generates the following assembly:

 MB     mov     %esi, %eax
 MB     mov     %edi, %edi
 MB     imulq   %rdi, %rax
 MB     addq    $32768, %rax
 MB     shrq    $16, %rax

 That does not match the C code though; it rounds negative values wrong.

 The C version does away-from-zero rounding.

Do you have test cases that show this?  I tried using random inputs,
but even up to billions of iterations, I can't seem to find a set of
inputs where my function yields different results from yours.

Thanks,

-Miles


-- 
Cat is power.  Cat is peace.

___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-06 Thread James Cloos
 MB == Miles Bader mi...@gnu.org writes:

 The C version does away-from-zero rounding.

MB Do you have test cases that show this?  I tried using random inputs,
MB but even up to billions of iterations, I can't seem to find a set of
MB inputs where my function yields different results from yours.

The C version saves the two signs, takes the absolute values,
multiplies, scales and then sets the sign.

When I tested, I used dd(1) to generate a quarter-gig file from urandom
(I used a fixed file so that it would be reproducable), mmap(2)ed that
to an int[], and went through two at a time.  The C and my initial asm
versions produced different results whenever the second int was -1 (ie
0x) and the first matched: (a  0  (a  0x == 0x8000)).

In other words, multiplying something like 7.5 by -1/65536.

An example of that test's output was:

  7AFA8000, , 8505, 8506, 0

In that example, 8505 is what the C version generates.

-JimC
-- 
James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6

___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-09-05 Thread James Cloos
The final result for amd64 looks like:

  static __inline__ long
  FT_MulFix_x86_64( long  a,
   long  b )
  {
register long  result;

__asm__ __volatile__ (
  movq  %1, %%rax\n
  imul  %2\n
  addq  %%rdx, %%rax\n
  addq  $0x8000, %%rax\n
  sarq  $16, %%rax\n
  : =a(result)
  : g(a), g(b)
  : rdx );
return result;
  }


The use of long, though requires review.  The C version uses FT_Long
(not FT_Int32 like the other asm versions), but FT_Long is not a #define
or a typedef at the point where the asm version are located.

That said, using long there on amd64 prevents unnecessary 32-64 bit
conversions in the resulting code.

The above code has a latency of 1+5+1+1+1 = 10 clocks on an amdfam10 cpu.

The assembly generated by the C code is 45 lines and 158 octets long,
contains six conditional jumps, three each of explicit compares and
tests, and still benchmarks are just as fast.  Out-of-order processing
wins out over hand-coded asm. :-/

It *might* make more of a difference on an in-order processor like the
Arom.  But I do not have one to test.

I can still finish a patch, and have collected the info I need to do one
for mips64, too, where I expect it will be more important.  I also expect
that the i386 version could be tidied a bit.

Is the amd64 version desired, given how little benefit it has?

-JimC
-- 
James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6

___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-08-12 Thread Werner LEMBERG

 I have to finish the patch, but I thought I'd offer the algorithm
 for review, if anyone wants to.

I haven't enough knowledge to comment, but thanks for working on it!


Werner

___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-08-08 Thread James Cloos
My first cut at FT_MulFix_x86_64() is:

static __inline__ FT_Int32
FT_MulFix_x86_64 (FT_Int32 a, FT_Int32 b) {
register FT_Int32 r;
__asm__ __volatile__ (
movslq %%edx, %%rdx\n
cltq\n
imul  %%rdx\n
addq  %%rdx, %%rax\n
addq  $0x8000, %%rax\n
sarq  $16, %%rax\n
: =a(r)
: a(a), d(b));
return r;
}

It passes a monte-carlo test comparing its results to the C code and to
the i386 assembly.

The logic is simple.  The first two instructions sign-extend the two
values to 64 bits, the multiply puts the least significant 64 bits of
the product in rax and the most significant bits in rdx; because the
values started out as 32 bit, rdx is guaranteed to be only sign bits:
zero if the product is =0, else -1.  Adding the resulting rdx to rax
serves the same purpose as the ecx value in the i386 version: it makes
the rounding symmetric around zero, just like the C code.

An alternative might be to cast the src values to (FT_Int64), but I
doubt that the compiler would generate any better code than calling
movslq and cltq.  

I have to finish the patch, but I thought I'd offer the algorithm for
review, if anyone wants to.

-JimC
-- 
James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6

___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


[ft-devel] FT_MulFix assembly

2010-08-06 Thread James Cloos
I see implementations for ia32 and arm; would other platforms benefit
from assembply implementations of MulFix?

I know I could translate the i386 assembly into amd64.  Probably others
with sufficient documentation.

But I can only test on amd64 and, thanks to cray-cyber.org, out of date
sparc and mips.  A gcc-cfarm account might permit wider testing, though.

I don't know how much of the market mips, sh4, et al have for small
devices with gui, but those types of devices would seem to have the
most need for improved performance, ja?

Is there interest?

And is there a good cli to use for testing; I'll need to output
renderings to png or netpbm when testing on remote boxen

-JimC
-- 
James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6

___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel


Re: [ft-devel] FT_MulFix assembly

2010-08-06 Thread Werner LEMBERG
 I see implementations for ia32 and arm; would other platforms
 benefit from assembply implementations of MulFix?

As usual: patches are highly welcomed.


Werner

___
Freetype-devel mailing list
Freetype-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/freetype-devel