> I'd like to challenge the readers to produce the best algorithm
>and/or code for finding factors of Mersenne numbers greater than 64-bits.
Beat this! (I'm sure it's possible). In particular my i86 assembler is
a bit rusty, so maybe a few clocks could be shaved off by use of more
appropriate registers.
Given a divisor with more than 64 but not more than 96 significant
bits, stored in low-byte-first order in an array of 32-bit integers
pointed to by q, this procedure calculates the remainder after
dividing into 2^p-1. The result is stored in an array of (at least 3)
32-bit integers pointed to by r. Only the first three elements of the
array pointed to by r will be written to; the array pointed to by q
is not written to at all. If it is not required to preserve the value
in q, then the procedure can be called with q & r pointing to the same
array.
If the value pointed to by q has less than 65 significant bits, the
first three elements of the array pointed to by r will contain all
"1" bits on exit. Note that this can never be a valid remainder.
Dependencies:
(a) A 32-bit version of Microsoft Visual C++, or something compatible
(MS VC++ 5.0 used for development)
(b) object processor, Intel 80386 compatible
(c) Microsoft __cdecl calling convention used by caller
(d) (I think) DS = SS is assumed...
Features:
(a) Timing, (normal exit) proportional to p (i.e. to logarithm of
number to be factored), but effectively independent of (legal) values
of *q. Approx. 12.5 * p / CPU speed(Hz), measured on Pentium 100
Classic, i.e. 0.625 sec for p ~ 5,000,000. This is about one quarter
the speed of the algorithm currently used by Prime95 for divisors up
to 62 bits in length.
(b) Manages to totally avoid memory access of data in the core loop
of the code (from label L3 to the conditional jump back to L3).
Branching is reduced to an absolute minimum.
(c) Does not use FPU at all. (If we're being *really* clever, perhaps
we could run some FPU-intensive code in parallel!)
(d) Algorithm easily implemented for any other CPU type with "normal"
32-bit arithmetic & logical operations.
(e) Gives an (almost) free test for divisibility of 2^p+k by *q - e.g.
if 2^p+1 is divisible by *q, then, on exit, *r will contain a value
which will be 2 less than (the entry value of) *q.
(f) Quotient is not stored by this procedure. Doing this would cause
a significant reduction in speed, not justified for trial factoring.
Copyright (c) Brian Beesley 23-Nov-1998
GNU Public Licence conditions apply
....................................................................
void remainder96(unsigned int p,unsigned int * q,unsigned int * r)
{
__asm
{
push esi
push edi
mov ebx,q
mov esi,-1
mov edi,esi
mov eax,32
mov edx,[ebx+8]
L1: and edx,edx
js L2
dec eax
jz errex
shl edx,1
shr edi,1
jmp L1
L2: add eax,64
neg eax
add eax,p
jnc toosml
mov ecx,eax
push ebp
mov ebp,[ebx+8]
mov edx,[ebx+4]
mov ebx,[ebx]
mov eax,esi
sub eax,ebx
sbb esi,edx
sbb edi,ebp
L3: stc
rcl eax,1
rcl esi,1
rcl edi,1
sub eax,ebx
sbb esi,edx
sbb edi,ebp
jnc L4
add eax,ebx
adc esi,edx
adc edi,ebp
L4: dec ecx
jnz L3
pop ebp
mov ebx,r
mov [ebx],eax
mov [ebx+4],esi
mov [ebx+8],edi
jmp exitOK
errex: mov ebx,r
mov [ebx],esi
mov [ebx+2],esi
mov [ebx+4],esi
jmp exitOK
toosml: mov ebx,q
mov eax,[ebx]
mov ecx,[ebx+4]
mov edx,[ebx+8]
mov ebx,r
mov [ebx],eax
mov [ebx+4],ecx
mov [ebx+8],edx
exitOK: pop edi
pop esi
}
}
....................................................................