Apologies, Jonas; I hope I didn't come off as condescending or rude in that last e-mail - I was joking because while my algorithm is faster, a few nanoseconds probably isn't going to cut it in the grand scheme of things!

I got my code to work, treating "magic add" in the same way as the existing algorithm, and further testing under -O4 shows that, while still faster, it's not as pronounced for small divisors, but still much faster for large divisors.  Of course, being x86_64 assembly language, I'm not expecting it to be approved even if it's isolated to a utility function, but I have linked a patch and a test program to this e-mail for those who are curious and as a proof of concept (tests on x86_64-win64 and x86_64-linux show no regressions).

The speed can be fractionally increased with the removal of the two internal error checks and the code that checks for powers of two (it's there to prevent incorrect results, but the routine shouldn't be called at all for powers of two, since they can be translated directly into bit shifts).  For the internal errors, one checks to see if the divisor is zero (an obviously bad expression), and the other is triggered if the total shift is greater than 127, which is the base bit count (e.g. 64) plus the base 2 logarithm of the divisor, so it shouldn't ever exceed 127 unless as new 128-bit interger type is introduced.

One final thing to note is that my algorithm produces different reciprocals (magic numbers) and shifts for some divisors, but the test program I've supplied shows that they're still correct, and careful analysis of the reciprocals will reveal that one is a bit-shift of the other (the existing algorithm seems to trim off trailing zero bits and reduces the shift to compensate).  Also, for divisors greater than N-1 bits in size, my algorithm produces a 'magic add' solution while the existing one doesn't, but for divisors this size, a simple comparison is faster than doing a multiplication and shift, since the quotient will always be 0 or 1 in these instances.

Gareth aka. Kit

On 18/10/2020 01:13, J. Gareth Moreton via fpc-devel wrote:
Well, I think you might be right on this one, Jonas!

I've tested my algorithm against the one used in the compiler. It's 5 times faster when used with small divisors (so loop iterations are minimal)... but that amounts to about 15 nanoseconds compared to 75 nanoseconds!  Additionally, it treats the "magic add" differently (the reason why it randomly failed).

Gareth aka. Kit

On 16/10/2020 23:14, J. Gareth Moreton via fpc-devel wrote:
On 16/10/2020 10:47, Jonas Maebe via fpc-devel wrote:
On 16/10/2020 10:14, J. Gareth Moreton via fpc-devel wrote:

Before I go optimising the wrong thing, I have a question to ask.
What's the policy on platform-specific assembly language in the
compiler, or any code designed to run on a specific (source) platform
(and using a more generic implementation otherwise via $ifdef)?  I ask
because I have a faster algorithm for "calc_divconst_magic_unsigned" in
'compiler/cgutils.pas', but it's only able to work because it can take
advantage of the x86 DIV instruction using RDX:RAX (or EDX:EAX) as a
double-wide dividend. It is somewhat faster than what currently exists
because of the lack of a loop whose iteration count is proportional to
log2(d), where d is the desired divisor (in other words, it's slower the
bigger the divisor is, whereas my algorithm is constant speed).
In general, there should be no assembly language in the compiler. Ialso
  don't think that's worth it in this case. Unless (or maybe "even if")
your code contains nothing but divisions by constants, I doubt this code
has a significant effect on the total compile time.

Division by constants has a fairly frequent occurrance in code. For example, dividing by 10000 whenever Currency is used, and 1000 often appears in timing measurements (e.g. if t is in milliseconds, then t div 1000 is seconds and t mod 1000 is the leftover milliseconds).

The existing code contains two divisions by a variable (so they can't be optimised) and a loop that has, at most, N iterations, where N is the bit size (often 32 or 64).  The loop contains only addition, subtraction and multiplication, and 3 branches to contend with (not including the repeat...until jump).  My code contains a single DIV, but also a BSR which is effectively used to get the base-2 logarithm of the divisor (also throws an internal error if the divisor is zero, since this should have been caught already).

Granted, you may be right and the saving won't be worth it, not to mention the additional complexity (and my function currently fails on certain divisors unexpectedly, so I'll have to do some deeper testing if just for my own peace of mind!) - only a lot of timing tests will determine that.  Nevertheless, thanks for providing the article to calculating the reciprocal though - that's definitely helpful in understanding what's going on.

Gareth aka. Kit

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel




--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
Index: compiler/cgutils.pas
===================================================================
--- compiler/cgutils.pas        (revision 47116)
+++ compiler/cgutils.pas        (working copy)
@@ -409,7 +409,9 @@
         ad,anc,delta,q1,r1,q2,r2,t: aWord;
         two_N_minus_1: aWord;
       begin
-        assert((d<-1) or (d>1));
+        if not ((d<-1) or (d>1)) then
+          InternalError(2020101601);
+
         two_N_minus_1:=aWord(1) shl (N-1);
 
         ad:=abs(d);
@@ -445,6 +447,207 @@
       end;
 
 
+{$if defined(CPUX86_64) and defined(cpu64bitalu)}
+    { We can take advantage of the RDX:RAX dividend of the x86_64 DIV operand 
for a faster algorithm }
+    procedure calc_divconst_magic_unsigned(N: byte; d: aWord; out magic_m: 
aWord; out magic_add: boolean; out magic_shift: byte); assembler; nostackframe;
+      asm
+      {$ifdef MSWINDOWS} { Microsoft x64 ABI }
+        { %cl        = N
+          %rdx       = d
+          %r8        = magic_m address
+          %r9        = magic_add address
+          0x28(%rsp) = magic_shift address
+        }
+
+        cmp  $64,  %cl
+        movzbl %cl,%eax { Preserve %cl }
+
+        { Compute bitmask }
+        setb %r11b { Account for situation where %cl is 64 by setting it to 
zero initially }
+
+        { Compute floor(log2(d)) }
+        bsr  %rdx, %r10
+
+        movzbl %r11b,%r11d
+        shl  %cl,  %r11
+        sub  $1,   %r11
+        mov  %r11, 8(%rsp) { Store bit mask for later in shadow space }
+
+        mov  $2020101610,%ecx { Prepare possible internal error call }
+        jz   InternalError { InternalError(2020101610) - d was zero }
+
+        { Do a quick power of two check (is a power of two iif d and (d - 1) = 
0}
+        lea  -1(%rdx),%r11
+        movzbl %al,%ecx { Restore %ecx }
+        test %rdx, %r11
+
+        { Get address of magic_shift }
+        mov  0x28(%rsp),%r11
+        mov  $1,   %eax
+
+        jnz  .LNotPower2
+
+        sub  %r10b,%cl
+        movb $0,   (%r9)
+        shl  %cl,  %rax
+        mov  %rax, (%r8)
+        movb $0,   (%r11)
+        ret
+
+      .LNotPower2:
+        mov  %r10b,(%r11) { We already have the shift, so write it out now 
(allows us to reuse %r11) }
+
+        { Combine the bit shifts }
+        add  %r10b,%cl
+
+        { Preserve d since we need to use %rdx specifically }
+        mov  %rdx, %r10
+        { Preserve N since we need to use %rcx specifically }
+        movzbl %cl,%r11d
+
+        { Note, if we ever need to support a 128-bit ALU, this function will 
need
+          to be rewritten to accommodate it, otherwise 
InternalError(2020101611)
+          will be raised. i.e. 64 bits maximum for now }
+        cmp  $127, %r11b
+        mov  $2020101611,%ecx { Prepare possible internal error call }
+        ja   InternalError { InternalError(2020101611) - bit size was too 
large }
+
+        { Restore %ecx }
+        mov  %r11d,%ecx
+        xor  %edx, %edx
+        shl  %cl,  %rax
+        sub  $64,  %cl
+        jb   .LSkipHighShift
+
+        mov  $1,   %edx
+        { Shifts greater than 64 can cause %rax to take on an unpredictable
+          value (specifically 1 shl (%cl and $3F)), so zero it here }
+        xor  %eax, %eax
+        shl  %cl,  %rdx
+      .LSkipHighShift:
+        { Calculate the initial reciprocal (magic number) }
+        div  %r10
+
+        { Get address of magic_shift }
+        mov  0x28(%rsp),%r11
+
+        { Determine if most-significant fractional bit is 1 or 0 by determining
+          if the remainder >= d/2 }
+        shl  $1,   %rdx
+        jc   .LMagicAddOverflow { If the above shift overflowed, then it 
definitely is greater than d now }
+        add  $1,   %rdx { This ensures that the carry gets set if the 
remainder is exactly d/2 }
+        jc   .LMagicAddOverflow { If the above add overflowed, then it 
definitely is greater than d now }
+        cmp  %rdx, %r10 { Note the order of the operands - this forces the 
carry bit to be set if %rdx >= %r10 }
+      .LMagicAddOverflow:
+        setnc (%r9)
+        jc   .LMagicAddEnd
+
+        { Shift the reciprocal up by 1 bit (the msb gets lost, but this is 
retrieved via the addition after multiplication }
+        shl  $1,   %rax
+        { Adjust magic_shift }
+        addb $1,   (%r11)
+      .LMagicAddEnd:
+        { Set magic add if the carry is clear }
+        add  $1,   %rax { Increment the reciprocal }
+        and  8(%rsp),%rax { Apply the bitmask to the reciprocal }
+        mov  %rax, (%r8)
+
+      {$else MSWINDOWS} { System V ABI }
+
+        { %dil       = N
+          %rsi       = d
+          %rdx       = magic_m address
+          %rcx       = magic_add address
+          %r8        = magic_shift address
+        }
+
+        cmp  $64,  %dil
+
+        { Preserve %rcx since we need to use it specifically }
+        mov  %rcx, %r9
+
+        { Compute bitmask }
+        setb %r11b { Account for situation where %cl is 64 by setting it to 
zero initially }
+
+        { Compute floor(log2(d)) }
+        bsr  %rsi, %r10
+
+        movzbl %dil,%ecx { We have to use %cl for shl }
+        movzbl %r11b,%r11d
+        shl  %cl,  %r11
+        sub  $1,   %r11
+
+        mov  $2020101610,%edi { Prepare possible internal error call }
+        jz   InternalError { InternalError(2020101610) - d was zero }
+
+        { Do a quick power of two check (is a power of two iif d and (d - 1) = 
0}
+        lea  -1(%rsi),%rdi
+        mov  $1,   %eax
+        test %rsi, %rdi
+
+        jnz  .LNotPower2
+
+        sub  %r10b,%cl
+        movb $0,   (%r8)
+        shl  %cl,  %rax
+        movb $0,   (%r9)
+        mov  %rax, (%rdx)
+        ret
+
+      .LNotPower2:
+        { Combine the bit shifts }
+        add  %r10b,%cl
+
+        { Note, if we ever need to support a 128-bit ALU, this function will 
need
+          to be rewritten to accommodate it, otherwise 
InternalError(2020101611)
+          will be raised. i.e. 64 bits maximum for now }
+        cmp  $127, %cl
+        mov  $2020101611,%edi { Prepare possible internal error call }
+        ja   InternalError { InternalError(2020101611) - bit size was too 
large }
+
+        { Preserve %rdx since we need to use that register specifically }
+        mov  %rdx, %rdi
+
+        xor  %edx, %edx
+        shl  %cl,  %rax
+        sub  $64,  %cl
+        jb   .LSkipHighShift
+
+        mov  $1,   %edx
+        { Shifts greater than 64 can cause %rax to take on an unpredictable
+          value (specifically 1 shl (%cl and $3F)), so zero it here }
+        xor  %eax, %eax
+        shl  %cl,  %rdx
+      .LSkipHighShift:
+        { Calculate the initial reciprocal (magic number) }
+        div  %rsi
+
+        { Determine if most-significant fractional bit is 1 or 0 by determining
+          if the remainder >= d/2 }
+        shl  $1,   %rdx
+        jc   .LMagicAddOverflow { If the above shift overflowed, then it 
definitely is greater than d now }
+        add  $1,   %rdx { This ensures that the carry gets set if the 
remainder is exactly d/2 }
+        jc   .LMagicAddOverflow { If the above add overflowed, then it 
definitely is greater than d now }
+        cmp  %rdx, %rsi { Note the order of the operands - this forces the 
carry bit to be set if %rdx >= %rsi }
+      .LMagicAddOverflow:
+        setnc (%r9)
+        jc   .LMagicAddEnd
+
+        { Shift the reciprocal up by 1 bit (the msb gets lost, but this is 
retrieved via the addition after multiplication }
+        shl  $1,   %rax
+        { Adjust magic_shift }
+        add  $1,   %r10b
+      .LMagicAddEnd:
+
+        add  $1,   %rax { Increment the reciprocal }
+        mov  %r10b, (%r8) { Write magic shift }
+        and  %r11, %rax { Apply the bitmask to the reciprocal }
+        mov  %rax, (%rdi)
+      {$endif MSWINDOWS}
+      end;
+
+{$else CPUX86_64 and cpu64bitalu}
+
     procedure calc_divconst_magic_unsigned(N: byte; d: aWord; out magic_m: 
aWord; out magic_add: boolean; out magic_shift: byte);
       var
         p: aInt;
@@ -494,6 +697,7 @@
         magic_m:=(q2+1) and mask;        { resulting magic number }
         magic_shift:=p-N;     { resulting shift }
       end;
+{$endif CPUX86_64 and cpu64bitalu}
 {$pop}
 
 end.

<<attachment: testprog.zip>>

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to