On 15 Feb 2002, at 18:35, [EMAIL PROTECTED] wrote: > The algorithm under discussion does share some superficial > features with p-1, in that we do a bunch of a big-integer > modular multiplies followed by a gcd to extract any factors. > But in fact we have complete control over the factors that > will be found, by way of the candidates we throw into the > accumulated product. So it really is equivalent to sieve-based > trial factoring, the difference being in the way we test the > candidates.
Ah, now I see clearly what's going on. > > >The other point here is that the GCD is _expensive_. On a 1GHz > >P3 or Athlon the GCD at the end of P-1 on exponent ~10^7 takes ~ > >10 minutes. > > Ah, but that's (as Rich Schroeppel pointed out) the beauty of > modular arithmetic. If we only need to do a single gcd, who cares > if it takes ten minutes or even a few hours? All right, let's assume that the cost of the GCD is zero, or as close to zero as makes no difference. The difference between the new algorithm and "traditional" trial factoring is that, for each possible factor F that survives being sieved out, we calculate P(n+1) = (P(n)*F) (mod 2^p-1) and hope that, when we have incorporated sufficient possible factors, we find GCD(P(N),2^p-1) > 1 whereas the "traditional" method simply calculates 2^p (mod F) for each F in turn hoping to find the result 1 indicating that we have found a factor. So the viability of the new algorithm depends on whether we can compute P*F (mod 2^p-1) faster than we can compute 2^p (mod F). The efficiency of the sieving is irrelevant, because if we can efficiently sieve out factors for one algorithm, we can use exactly the same sieving technique with the other algorithm. When the average value of P is 2^(p-1), F ~ 2^64 and p ~ 10^7 this looks (to say the least) rather improbable. (Though see later in this message ...) The point here is that running P-1 stage 1 with B1=10^6 takes a while, but (a) there are in general going to be a lot more non-trivially-sieved-out factor candidates between e.g. 2^64 and 2^65 than there are primes < 10^6, and (b) each multiplication certainly won't be faster - in fact we will probably be multiplying by a double- or triple-length scalar quantity, whereas 10^6 fits very easily into a single word. Now, if we could find a "weak pseudoprimality" test - needs to be _much_ faster than Fermat's Test, since computing 3^(f-1) (mod f) is going to take longer than computing 2^p (mod f) - we could use this to enhance the screening of factors & so speed up trial factoring (_any_ implementation of it). Might there be mileage is _this_ idea? The test need not be very accurate, if it's fast enough; certainly we can afford to take a lot more "false positives" than Fermat's Test lets through. > > If it became clear that we could implement this algorithm and > have it run within factor of no more than 3-4 times slower than > one-factor-at-a-time sieving for factors <= 64 bits, it would be > very promising indeed, since it suffers no large performance > hit for factor candidates above 64 bits. I still don't see this clearly. Surely there is going to be a "large performance hit" when multiplying a very large integer by a scalar when the scalar jumps in size from N to N+1 computer words, and N is a small number (2 for 32-bit architectures, 1 for 64-bit architectures). > And it would allow > machines with poor integer multiply (e.g. the Sparc family) > to join in the factoring work, since most of the work could > use floating-point arithmetic, irrespective of the size of > the factors. Is this _really_ a problem, or is it likely to become one in the forseeable future? Trial factoring is well ahead of LL testing at present, and seems to be still pulling further ahead, despite the improvement in the relative efficiency of LL testing vs. trial factoring with fairly widespread deployment of P4 systems. I would have thought that, in the event that LL testing did start to catch up to trial factoring, the first thing to do would be to reduce the initial trial factoring depth by 1, then run P-1 and only then proceed to do the last bit of trial factoring. In fact on P4 systems it might be more economical to stop trial factoring sooner than that. ........................................... Enough sceptisicm; here's a pointer to how thought triggered by this discussion could help improve the efficiency of other factoring methods: It occurred to me that the operation (P * F) (mod 2^p-1) implicit in this proposal - and P-1 stage 1 - is messily unbalanced and could probably be improved. Here's a shot at it. Note that, if F has an upper bound of 2^k, then F(a) * F(b) has an upper bound of 2^(2k). Suppose we build a layered structure as follows. Each layer consists of two temporary storage areas R1(L), R2(L) and one flag f(L). The temporary storage areas in the bottom layer need to be big enough to hold the biggest value we might be considering (i.e. B1 in P-1 stage 1). In the next layer the temporary storage areas are big enough to hold the square of that value, etc, etc. The top layer is determined when the maximum possible value exceeds 2^p-1. The flags are simply used to indicate whether the temporary storage areas in each level are in use. To start: set all the flags in all the levels to 0. For each factor F: If f(0) is 0, set f(0) = 1 & store F in R1(0). Else, set f(0) = 0; store F in R2(0); compute R1(0) * R2(0); if f(1) = 0, set f(1) = 1 & store result in R1(1), else set f(1) = 0; store result in R2(1); compute R1(1) * R2(1); etc, etc. In the highest level we compute (R1(L) * R2(L)) (mod 2^p-1), store the result in R1(L) and reset f(L)=1. Only at this level do we need the modulus operation; all other products are definitely smaller than 2^p-1 so the modulus operation is not neccessary. Note that in general we will not have 2^N possible factors, so when we have processed the last factor, the computation is incomplete. We can complete the computation by jumping to the lowest level with f(L) = 1, moving R1(L) to R1(L+1) if f(L+1)=0 else R2(L+1) & so on. The point of doing the calculation this way is that: (a) half of the time we will be only multiplying together smallish numbers; (b) each multiplication can be optimized by its size, using direct multiplication, FFT or some intermediate method as appropriate; (c) this method places a lower demand on memory bandwidth, resulting in a probable increase in speed even if there is no direct reduction in the work neccessary (which in fact there is). Regards Brian Beesley _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
