On 15 Feb 2002, at 18:35, [EMAIL PROTECTED] wrote:

> The algorithm under discussion does share some superficial
> features with p-1, in that we do a bunch of a big-integer
> modular multiplies followed by a gcd to extract any factors.
> But in fact we have complete control over the factors that
> will be found, by way of the candidates we throw into the
> accumulated product. So it really is equivalent to sieve-based
> trial factoring, the difference being in the way we test the
> candidates.

Ah, now I see clearly what's going on.
> 
> >The other point here is that the GCD is _expensive_. On a 1GHz 
> >P3 or Athlon the GCD at the end of P-1 on exponent ~10^7 takes ~ 
> >10 minutes.
> 
> Ah, but that's (as Rich Schroeppel pointed out) the beauty of
> modular arithmetic. If we only need to do a single gcd, who cares
> if it takes ten minutes or even a few hours?

All right, let's assume that the cost of the GCD is zero, or as close 
to zero as makes no difference.

The difference between the new algorithm and "traditional" trial 
factoring is that, for each possible factor F that survives being 
sieved out, we calculate P(n+1) = (P(n)*F) (mod 2^p-1) and hope 
that, when we have incorporated sufficient possible factors, we find 
GCD(P(N),2^p-1) > 1 whereas the "traditional" method simply 
calculates 2^p (mod F) for each F in turn hoping to find the result 1 
indicating that we have found a factor.

So the viability of the new algorithm depends on whether we can 
compute P*F (mod 2^p-1) faster than we can compute 2^p (mod F).
The efficiency of the sieving is irrelevant, because if we can 
efficiently sieve out factors for one algorithm, we can use exactly 
the same sieving technique with the other algorithm.

When the average value of P is 2^(p-1), F ~ 2^64 and p ~ 10^7 this 
looks (to say the least) rather improbable. (Though see later in this 
message ...) The point here is that running P-1 stage 1 with 
B1=10^6 takes a while, but (a) there are in general going to be a lot 
more non-trivially-sieved-out factor candidates between e.g. 2^64 
and 2^65 than there are primes < 10^6, and (b) each multiplication 
certainly won't be faster - in fact we will probably be multiplying by 
a double- or triple-length scalar quantity, whereas 10^6 fits very 
easily into a single word.

Now, if we could find a "weak pseudoprimality" test - needs to be 
_much_ faster than Fermat's Test, since computing 3^(f-1) (mod f) 
is going to take longer than computing 2^p (mod f) - we could use 
this to enhance the screening of factors & so speed up trial 
factoring (_any_ implementation of it). Might there be mileage is 
_this_ idea? The test need not be very accurate, if it's fast enough; 
certainly we can afford to take a lot more "false positives" than 
Fermat's Test lets through.
> 
> If it became clear that we could implement this algorithm and
> have it run within factor of no more than 3-4 times slower than
> one-factor-at-a-time sieving for factors <= 64 bits, it would be
> very promising indeed, since it suffers no large performance
> hit for factor candidates above 64 bits.

I still don't see this clearly. Surely there is going to be a "large 
performance hit" when multiplying a very large integer by a scalar 
when the scalar jumps in size from N to N+1 computer words, and 
N is a small number (2 for 32-bit architectures, 1 for 64-bit 
architectures).

> And it would allow
> machines with poor integer multiply (e.g. the Sparc family)
> to join in the factoring work, since most of the work could
> use floating-point arithmetic, irrespective of the size of
> the factors.

Is this _really_ a problem, or is it likely to become one in the 
forseeable future? Trial factoring is well ahead of LL testing at 
present, and seems to be still pulling further ahead, despite the 
improvement in the relative efficiency of LL testing vs. trial factoring 
with fairly widespread deployment of P4 systems.

I would have thought that, in the event that LL testing did start to 
catch up to trial factoring, the first thing to do would be to reduce 
the initial trial factoring depth by 1, then run P-1 and only then 
proceed to do the last bit of trial factoring. In fact on P4 systems it 
might be more economical to stop trial factoring sooner than that.

...........................................

Enough sceptisicm; here's a pointer to how thought triggered by 
this discussion could help improve the efficiency of other factoring 
methods:

It occurred to me that the operation (P * F) (mod 2^p-1) implicit in 
this proposal - and P-1 stage 1 - is messily unbalanced and could 
probably be improved. Here's a shot at it.

Note that, if F has an upper bound of 2^k, then F(a) * F(b) has an 
upper bound of 2^(2k).

Suppose we build a layered structure as follows. Each layer 
consists of two temporary storage areas R1(L), R2(L) and one flag 
f(L). The temporary storage areas in the bottom layer need to be 
big enough to hold the biggest value we might be considering (i.e. 
B1 in P-1 stage 1). In the next layer the temporary storage areas 
are big enough to hold the square of that value, etc, etc. The top 
layer is determined when the maximum possible value exceeds 
2^p-1. The flags are simply used to indicate whether the temporary 
storage areas in each level are in use.

To start: set all the flags in all the levels to 0.

For each factor F:

If f(0) is 0, set f(0) = 1 & store F in R1(0).
Else, set f(0) = 0; store F in R2(0); compute R1(0) * R2(0); if f(1) = 
0, set f(1) = 1 & store result in R1(1), else set f(1) = 0; store result 
in R2(1); compute R1(1) * R2(1); etc, etc.

In the highest level we compute (R1(L) * R2(L)) (mod 2^p-1), store 
the result in R1(L) and reset f(L)=1. Only at this level do we need 
the modulus operation; all other products are definitely smaller than 
2^p-1 so the modulus operation is not neccessary.

Note that in general we will not have 2^N possible factors, so when 
we have processed the last factor, the computation is incomplete. 
We can complete the computation by jumping to the lowest level 
with f(L) = 1, moving R1(L) to R1(L+1) if f(L+1)=0 else R2(L+1) & so 
on.

The point of doing the calculation this way is that:
(a) half of the time we will be only multiplying together smallish 
numbers;
(b) each multiplication can be optimized by its size, using direct 
multiplication, FFT or some intermediate method as appropriate;
(c) this method places a lower demand on memory bandwidth, 
resulting in a probable increase in speed even if there is no direct 
reduction in the work neccessary (which in fact there is).


Regards
Brian Beesley
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to