Re: [fpc-devel] Policy on platform-specific compiler code

J. Gareth Moreton via fpc-devel Sat, 17 Oct 2020 21:07:35 -0700

Apologies, Jonas; I hope I didn't come off as condescending or rude inthat last e-mail - I was joking because while my algorithm is faster, afew nanoseconds probably isn't going to cut it in the grand scheme ofthings!

I got my code to work, treating "magic add" in the same way as theexisting algorithm, and further testing under -O4 shows that, whilestill faster, it's not as pronounced for small divisors, but still muchfaster for large divisors. Of course, being x86_64 assembly language,I'm not expecting it to be approved even if it's isolated to a utilityfunction, but I have linked a patch and a test program to this e-mailfor those who are curious and as a proof of concept (tests onx86_64-win64 and x86_64-linux show no regressions).

The speed can be fractionally increased with the removal of the twointernal error checks and the code that checks for powers of two (it'sthere to prevent incorrect results, but the routine shouldn't be calledat all for powers of two, since they can be translated directly into bitshifts). For the internal errors, one checks to see if the divisor iszero (an obviously bad expression), and the other is triggered if thetotal shift is greater than 127, which is the base bit count (e.g. 64)plus the base 2 logarithm of the divisor, so it shouldn't ever exceed127 unless as new 128-bit interger type is introduced.

One final thing to note is that my algorithm produces differentreciprocals (magic numbers) and shifts for some divisors, but the testprogram I've supplied shows that they're still correct, and carefulanalysis of the reciprocals will reveal that one is a bit-shift of theother (the existing algorithm seems to trim off trailing zero bits andreduces the shift to compensate). Also, for divisors greater than N-1bits in size, my algorithm produces a 'magic add' solution while theexisting one doesn't, but for divisors this size, a simple comparison isfaster than doing a multiplication and shift, since the quotient willalways be 0 or 1 in these instances.


Gareth aka. Kit

On 18/10/2020 01:13, J. Gareth Moreton via fpc-devel wrote:

Well, I think you might be right on this one, Jonas!
I've tested my algorithm against the one used in the compiler. It's 5times faster when used with small divisors (so loop iterations areminimal)... but that amounts to about 15 nanoseconds compared to 75nanoseconds! Additionally, it treats the "magic add" differently (thereason why it randomly failed).
Gareth aka. Kit

On 16/10/2020 23:14, J. Gareth Moreton via fpc-devel wrote:
On 16/10/2020 10:47, Jonas Maebe via fpc-devel wrote:
On 16/10/2020 10:14, J. Gareth Moreton via fpc-devel wrote:
Before I go optimising the wrong thing, I have a question to ask.
What's the policy on platform-specific assembly language in the
compiler, or any code designed to run on a specific (source) platform
(and using a more generic implementation otherwise via $ifdef)?  I ask
because I have a faster algorithm for"calc_divconst_magic_unsigned" in
'compiler/cgutils.pas', but it's only able to work because it can take
advantage of the x86 DIV instruction using RDX:RAX (or EDX:EAX) as a
double-wide dividend. It is somewhat faster than what currently exists
because of the lack of a loop whose iteration count is proportional to
log2(d), where d is the desired divisor (in other words, it'sslower the
bigger the divisor is, whereas my algorithm is constant speed).
In general, there should be no assembly language in the compiler. Ialso
  don't think that's worth it in this case. Unless (or maybe "even if")
your code contains nothing but divisions by constants, I doubt thiscode
has a significant effect on the total compile time.
Division by constants has a fairly frequent occurrance in code. Forexample, dividing by 10000 whenever Currency is used, and 1000 oftenappears in timing measurements (e.g. if t is in milliseconds, then tdiv 1000 is seconds and t mod 1000 is the leftover milliseconds).
The existing code contains two divisions by a variable (so they can'tbe optimised) and a loop that has, at most, N iterations, where N isthe bit size (often 32 or 64). The loop contains only addition,subtraction and multiplication, and 3 branches to contend with (notincluding the repeat...until jump). My code contains a single DIV,but also a BSR which is effectively used to get the base-2 logarithmof the divisor (also throws an internal error if the divisor is zero,since this should have been caught already).
Granted, you may be right and the saving won't be worth it, not tomention the additional complexity (and my function currently fails oncertain divisors unexpectedly, so I'll have to do some deeper testingif just for my own peace of mind!) - only a lot of timing tests willdetermine that. Nevertheless, thanks for providing the article tocalculating the reciprocal though - that's definitely helpful inunderstanding what's going on.
Gareth aka. Kit

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Index: compiler/cgutils.pas
===================================================================
--- compiler/cgutils.pas        (revision 47116)
+++ compiler/cgutils.pas        (working copy)
@@ -409,7 +409,9 @@
         ad,anc,delta,q1,r1,q2,r2,t: aWord;
         two_N_minus_1: aWord;
       begin
-        assert((d<-1) or (d>1));
+        if not ((d<-1) or (d>1)) then
+          InternalError(2020101601);
+
         two_N_minus_1:=aWord(1) shl (N-1);
 
         ad:=abs(d);
@@ -445,6 +447,207 @@
       end;
 
 
+{$if defined(CPUX86_64) and defined(cpu64bitalu)}
+    { We can take advantage of the RDX:RAX dividend of the x86_64 DIV operand 
for a faster algorithm }
+    procedure calc_divconst_magic_unsigned(N: byte; d: aWord; out magic_m: 
aWord; out magic_add: boolean; out magic_shift: byte); assembler; nostackframe;
+      asm
+      {$ifdef MSWINDOWS} { Microsoft x64 ABI }
+        { %cl        = N
+          %rdx       = d
+          %r8        = magic_m address
+          %r9        = magic_add address
+          0x28(%rsp) = magic_shift address
+        }
+
+        cmp  $64,  %cl
+        movzbl %cl,%eax { Preserve %cl }
+
+        { Compute bitmask }
+        setb %r11b { Account for situation where %cl is 64 by setting it to 
zero initially }
+
+        { Compute floor(log2(d)) }
+        bsr  %rdx, %r10
+
+        movzbl %r11b,%r11d
+        shl  %cl,  %r11
+        sub  $1,   %r11
+        mov  %r11, 8(%rsp) { Store bit mask for later in shadow space }
+
+        mov  $2020101610,%ecx { Prepare possible internal error call }
+        jz   InternalError { InternalError(2020101610) - d was zero }
+
+        { Do a quick power of two check (is a power of two iif d and (d - 1) = 
0}
+        lea  -1(%rdx),%r11
+        movzbl %al,%ecx { Restore %ecx }
+        test %rdx, %r11
+
+        { Get address of magic_shift }
+        mov  0x28(%rsp),%r11
+        mov  $1,   %eax
+
+        jnz  .LNotPower2
+
+        sub  %r10b,%cl
+        movb $0,   (%r9)
+        shl  %cl,  %rax
+        mov  %rax, (%r8)
+        movb $0,   (%r11)
+        ret
+
+      .LNotPower2:
+        mov  %r10b,(%r11) { We already have the shift, so write it out now 
(allows us to reuse %r11) }
+
+        { Combine the bit shifts }
+        add  %r10b,%cl
+
+        { Preserve d since we need to use %rdx specifically }
+        mov  %rdx, %r10
+        { Preserve N since we need to use %rcx specifically }
+        movzbl %cl,%r11d
+
+        { Note, if we ever need to support a 128-bit ALU, this function will 
need
+          to be rewritten to accommodate it, otherwise 
InternalError(2020101611)
+          will be raised. i.e. 64 bits maximum for now }
+        cmp  $127, %r11b
+        mov  $2020101611,%ecx { Prepare possible internal error call }
+        ja   InternalError { InternalError(2020101611) - bit size was too 
large }
+
+        { Restore %ecx }
+        mov  %r11d,%ecx
+        xor  %edx, %edx
+        shl  %cl,  %rax
+        sub  $64,  %cl
+        jb   .LSkipHighShift
+
+        mov  $1,   %edx
+        { Shifts greater than 64 can cause %rax to take on an unpredictable
+          value (specifically 1 shl (%cl and $3F)), so zero it here }
+        xor  %eax, %eax
+        shl  %cl,  %rdx
+      .LSkipHighShift:
+        { Calculate the initial reciprocal (magic number) }
+        div  %r10
+
+        { Get address of magic_shift }
+        mov  0x28(%rsp),%r11
+
+        { Determine if most-significant fractional bit is 1 or 0 by determining
+          if the remainder >= d/2 }
+        shl  $1,   %rdx
+        jc   .LMagicAddOverflow { If the above shift overflowed, then it 
definitely is greater than d now }
+        add  $1,   %rdx { This ensures that the carry gets set if the 
remainder is exactly d/2 }
+        jc   .LMagicAddOverflow { If the above add overflowed, then it 
definitely is greater than d now }
+        cmp  %rdx, %r10 { Note the order of the operands - this forces the 
carry bit to be set if %rdx >= %r10 }
+      .LMagicAddOverflow:
+        setnc (%r9)
+        jc   .LMagicAddEnd
+
+        { Shift the reciprocal up by 1 bit (the msb gets lost, but this is 
retrieved via the addition after multiplication }
+        shl  $1,   %rax
+        { Adjust magic_shift }
+        addb $1,   (%r11)
+      .LMagicAddEnd:
+        { Set magic add if the carry is clear }
+        add  $1,   %rax { Increment the reciprocal }
+        and  8(%rsp),%rax { Apply the bitmask to the reciprocal }
+        mov  %rax, (%r8)
+
+      {$else MSWINDOWS} { System V ABI }
+
+        { %dil       = N
+          %rsi       = d
+          %rdx       = magic_m address
+          %rcx       = magic_add address
+          %r8        = magic_shift address
+        }
+
+        cmp  $64,  %dil
+
+        { Preserve %rcx since we need to use it specifically }
+        mov  %rcx, %r9
+
+        { Compute bitmask }
+        setb %r11b { Account for situation where %cl is 64 by setting it to 
zero initially }
+
+        { Compute floor(log2(d)) }
+        bsr  %rsi, %r10
+
+        movzbl %dil,%ecx { We have to use %cl for shl }
+        movzbl %r11b,%r11d
+        shl  %cl,  %r11
+        sub  $1,   %r11
+
+        mov  $2020101610,%edi { Prepare possible internal error call }
+        jz   InternalError { InternalError(2020101610) - d was zero }
+
+        { Do a quick power of two check (is a power of two iif d and (d - 1) = 
0}
+        lea  -1(%rsi),%rdi
+        mov  $1,   %eax
+        test %rsi, %rdi
+
+        jnz  .LNotPower2
+
+        sub  %r10b,%cl
+        movb $0,   (%r8)
+        shl  %cl,  %rax
+        movb $0,   (%r9)
+        mov  %rax, (%rdx)
+        ret
+
+      .LNotPower2:
+        { Combine the bit shifts }
+        add  %r10b,%cl
+
+        { Note, if we ever need to support a 128-bit ALU, this function will 
need
+          to be rewritten to accommodate it, otherwise 
InternalError(2020101611)
+          will be raised. i.e. 64 bits maximum for now }
+        cmp  $127, %cl
+        mov  $2020101611,%edi { Prepare possible internal error call }
+        ja   InternalError { InternalError(2020101611) - bit size was too 
large }
+
+        { Preserve %rdx since we need to use that register specifically }
+        mov  %rdx, %rdi
+
+        xor  %edx, %edx
+        shl  %cl,  %rax
+        sub  $64,  %cl
+        jb   .LSkipHighShift
+
+        mov  $1,   %edx
+        { Shifts greater than 64 can cause %rax to take on an unpredictable
+          value (specifically 1 shl (%cl and $3F)), so zero it here }
+        xor  %eax, %eax
+        shl  %cl,  %rdx
+      .LSkipHighShift:
+        { Calculate the initial reciprocal (magic number) }
+        div  %rsi
+
+        { Determine if most-significant fractional bit is 1 or 0 by determining
+          if the remainder >= d/2 }
+        shl  $1,   %rdx
+        jc   .LMagicAddOverflow { If the above shift overflowed, then it 
definitely is greater than d now }
+        add  $1,   %rdx { This ensures that the carry gets set if the 
remainder is exactly d/2 }
+        jc   .LMagicAddOverflow { If the above add overflowed, then it 
definitely is greater than d now }
+        cmp  %rdx, %rsi { Note the order of the operands - this forces the 
carry bit to be set if %rdx >= %rsi }
+      .LMagicAddOverflow:
+        setnc (%r9)
+        jc   .LMagicAddEnd
+
+        { Shift the reciprocal up by 1 bit (the msb gets lost, but this is 
retrieved via the addition after multiplication }
+        shl  $1,   %rax
+        { Adjust magic_shift }
+        add  $1,   %r10b
+      .LMagicAddEnd:
+
+        add  $1,   %rax { Increment the reciprocal }
+        mov  %r10b, (%r8) { Write magic shift }
+        and  %r11, %rax { Apply the bitmask to the reciprocal }
+        mov  %rax, (%rdi)
+      {$endif MSWINDOWS}
+      end;
+
+{$else CPUX86_64 and cpu64bitalu}
+
     procedure calc_divconst_magic_unsigned(N: byte; d: aWord; out magic_m: 
aWord; out magic_add: boolean; out magic_shift: byte);
       var
         p: aInt;
@@ -494,6 +697,7 @@
         magic_m:=(q2+1) and mask;        { resulting magic number }
         magic_shift:=p-N;     { resulting shift }
       end;
+{$endif CPUX86_64 and cpu64bitalu}
 {$pop}
 
 end.

<<attachment: testprog.zip>>

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Policy on platform-specific compiler code

Reply via email to