As a learning exercise I took a standard Julia function:
factor() from primes.jl, and tried various ways to speed it up, without
changing its structure much. Below are a few versions, and running time
measurements. Please suggest improvements, so we can learn Julia.
These are only Int versions, tested under Win7-x64, so Int = Int64.
One should extend them to other integer sizes, including unsigned types
_after_ the code is shortened, simplified and optimized.
For Int128 input we should probably combine trial divisors and Fermat's
method. For BigInts we cannot beat the algorithms implemented in gmp,
at least not with simple code and without using external (disk) storage.
For small integers (up to 64 bits) trial division based algorithms seem
to be good enough. All the presented algorithms work this way.
The differences are in
- the use of prime tables,
- using coprimes mod 2 or mod 30 for selecting trial divisors and
- the output data structure.
The original factor() Julia function returns the result in a dictionary,
of the form [prime_factor => exponent].
Other output alternative structures include
- a vector of the factors, repeated as needed, or a
- two-column matrix, the first column containing the prime factors in
increasing order, and the second column has the corresponding
exponents (the repetitions).
A significant portion of the running time is devoted to allocating the
data structure for the results. Dictionaries seem to be highly optimized,
even faster than allocating matrices.
If we provide a large enough buffer to the function for the result, we
can reuse it for later calls, saving the data allocation time. However,
the resulting functions will not be re-entrant, they cannot run
concurrently, unless different buffers are specified.
Up to 10,000 there are 1229 primes. A precomputed table for them allows
easy trials with these divisors, and nothing between them, as done in
Julia's factor() function. One could try every odd numbers as divisors,
which makes 5,000 trials below 10,000.
Trying divisors relative prime to 30 = 2*3*5, reduces the number of
trials to 2666, only about twice as many as at true prime divisors (to
the same limit of 10,000). The advantage of trying coprimes to 30 is
that the computation is done with indexing in two small tables of 29
entries each, without extra division or multiplication operations.
These tables could be of single byte entries, which leads to space
savings from the prime table (2.5KB -> 58B).
One can use another way to traverse the coprimes mod 30:
use an 8 byte table g = [6 4 2 4 2 4 6 2]; and increment the divisors
as d += g[(i=+1)&7+1]. It turned out slower than the implemented method.
The next logical step was to use as trial divisors the coprimes to
2*3*5*7 = 210. There are 48 of them in an interval of length 210.
(There are 105 odd numbers here, so we have more than a two fold
speed-up. However, the improvement over coprimes mod 30 is modest:
48/56 ~ 17%. The tables can still be of single byte entries, so the
storage is 420 bytes.)
On the other hand, there are 70 divisors coprime mod 6 in length 210
intervals, still significantly fewer than the odd numbers. The
corresponding scheme can be implemented without tables: just use
alternating step sizes of 2 and 4 (step = 6-step; d += step).
One can experiment with larger prime tables, too. Atkins' sieve built
in Julia allows very fast computation of this table (0.08 seconds in my
PC for primes(10^5)), but the storage space can become an issue: there
are 9592 primes below 100,000. In the code below we kept using
primes(10^4) for a fair comparison to Julia's factor() function.
Maybe the best trade off is using a list of primes < 2^16, which can be
stored in a Uint16 vector of length 6542 (13KB). (Julia uses 1229*8
~ 10KB in a 64 bit system.) That only needs an int(PRIMES3) type
conversion in the for loops scanning primes, with no significant time
penalty.
One can easily make the size of the used prime table configurable. It
can even be changed in the middle of the computation, freeing space, if
needed, vs. improving the speed of factorization. However, 13KB storage
can seldom be a bottleneck in current computers.
Running times are listed below, in my Dell Precision M4700 laptop,
Win7-x64, Julia Version 0.3.0-prerelease+2157.
Before one jumps to a conclusion about the best algorithm, we have to
look for code improvements. Also, different applications use different
ranges of inputs (where the runing times should be averaged). That's why
there are so many input intervals below, where running times are listed.
Nevertheless, we already see that Julia's factor() function can be sped
up with little effort. Currently factorW() and factorP() look the best,
while the term collecting algorithm in factorU() might be useful in
other applications, too.
The worst case running time is at large (64-bit) prime numbers. The
trial division algorithms have to try divisors till 2^32, most of which
are outside of any practical size prime table. With using coprime
divisors mod 30, only 4/15 of these numbers need to be actually tried,
with about 2^30 division (mod) operations. It is quite slow.
If many large primes are expected to be given as input, a good strategy
is to first test for primality, which is very fast. However, this does
not help if the input is the product of two large primes ~2^32. They
still need a lot of effort to factor.
~~~ CODE ~~~
module IntFactor #######################################################
export factorU, factorV, factorX, factorW, factorP
function factorV(n::Integer, V::Array{Int}) # factors of n into vector V
given at call
global a, b
local c, d, m
n = abs(n) # n -> Int
n < 2 && (return 0)
c = trailing_zeros(n) # c = number of prime divisors
n >>= c # remove factors of 2
V[1:c] = 2 # save factors of 2
d = m = 3
while d*d <= n # d <= sqrt(n) (updated constant) is
of similar speed for small n
if n % d == 0 # divrem() is too slow
n = div(n,d)
V[c+=1] = d
else
d+= a[m] # works starting from small primes:
d=m = 2,3,5,7,11,13,17,19,23 or 29
m = b[m] # speed-up over d+=2
end
end
n > 1 && (V[c+=1] = n) # last term
return c
end
function factorUV(n::Integer) # Calls factorV, returns
[factors,exponents] VECTORIZED (slow!)
global V # multi-threaded: allocate these
arrays locally
c = factorV(n,V) # repeated factors in V, in increasing
order
c == 0 && (return Array(Int,0,0))
return [unique(V[1:c]) diff(find([1,diff(V[1:c]),1]))]
end
function factorU(n::Integer) # Calls factorV, returns [factors,exponents]
global V, W1, W2 # multi-threaded: allocate these arrays
locally
local c, d, i, j, l
c = factorV(n,V) # repeated factors in V, in increasing
order
c == 0 && (return Array(Int,0,0))
d = V[1]; j = 1; l = 0
for i = 1:c # i must have been defined
if V[i] != d
W1[l+=1] = d; W2[l] = i-j
d = V[i]; j = i
end
end
W1[l+=1]=d; W2[l]=i-j+1 # last term
return [W1[1:l] W2[1:l]]
end
function factorX(n::Integer) # -> Array [factors,exponents] of n
global a, b, W1, W2
local c, d, k, m, s
n = abs(n) # n -> Int
n < 2 && (return Array(Int,0,0))
if iseven(n)
c = trailing_zeros(n)
n >>= c # remove trailing 0's
W1[1] = 2; W2[1] = c # save factors of 2
c = 1 # c = number of prime divisors
else c = 0
end
s = isqrt(n)
d = m = 3
while d <= s # d*d <= n (for small n it is of
similar speed)
if n % d == 0
n = div(n,d)
W1[c+=1] = d
for k = 1:81 # count divisions, k must be already
defined, 3^81>2^128
n % d > 0 && break
n = div(n,d) # divide out all divisors = d
end
W2[c] = k # store exponent
s = isqrt(n)
else
d+= a[m] # start from small primes:
mod(d,30) = m = 2,3...,29
m = b[m] # small speed-up over d+=2
end
end
if n > 1 # store the last remaining term
W1[c+=1] = n; W2[c] = 1
end
return [W1[1:c] W2[1:c]] # create the results data structure
end
function factorW(n::Integer) # -> Array [factors,exponents] of n
global a, b, W1, W2, PRIMES3, m0, d0
local c, d, k, m, s
n = abs(n) # n -> Int
n < 2 && (return Array(Int,0,0))
if iseven(n)
c = trailing_zeros(n)
n >>= c # remove trailing 0's
W1[1] = 2; W2[1] = c # save factors of 2
c = 1 # c = number of prime divisors
else c = 0
end
s = isqrt(n)
for d in PRIMES3
if d > s # finished if d*d > n
n > 1 && (W1[c+=1] = n; W2[c] = 1)
return [W1[1:c] W2[1:c]]
end
if n % d == 0
n = div(n,d) # divide out divisor d
W1[c+=1] = d
for k = 1:81 # count divisions, k must be already
defined, 3^81>2^128
n % d > 0 && break
n = div(n,d) # divide out all divisors = d
end
W2[c] = k # store exponent
s = isqrt(n)
end
end
m = m0; d = d0 # continue from last of PRIMES3 with
divisors coprime to 30
while d <= s # d*d <= n
if n % d == 0
n = div(n,d) # divide out divisor d
W1[c+=1] = d
for k = 1:81 # count divisions, k must be already
defined, 3^81>2^128
n % d > 0 && break
n = div(n,d) # divide out all divisors = d
end
W2[c] = k # store exponent
s = isqrt(n)
else
d+= a[m] # start from small primes: m =
mod(d,30) = 2,3...,29
m = b[m] # small speed-up over d+=2
end
end
n > 1 && (W1[c+=1] = n; W2[c] = 1) # store the final term
return [W1[1:c] W2[1:c]] # construct the results array
end
function factorP(n::Integer) # -> Dictionary [factor=>exponent] of n
global a, b, PRIMES3, m0, d0
local h, d, k, m, s
n = abs(n) # convert to Int >= 0
h = (Int=>Int)[] # Dictionary
n < 2 && return h # n = 0,1: empty dictionary
if iseven(n)
d = trailing_zeros(n)
n >>= d # remove trailing 0's
h[2] = d # store "2=>exponent"
end
s = isqrt(n)
for d in PRIMES3
if d > s # if d*d > n: done
n > 1 && (h[n] = 1) # last factor
return h
end
if n % d == 0
n = div(n,d) # divide out divisor d
for k = 1:81 # count divisions, k must be already
defined, 3^81>2^128
n % d > 0 && break
n = div(n,d) # divide out all divisors = d
end
h[d] = k # store factor=>exponent
s = isqrt(n) # n changed: update s = sqrt(n)
end
end
m = m0; d = d0 # continue from last of PRIMES3 with
divisors coprime to 30
while d <= s # d*d <= n
if n % d == 0
n = div(n,d) # divide out divisor d
for k = 1:81 # count divisions
n % d > 0 && break
n = div(n,d) # divide out all divisors = d
end
h[d] = k # store exponent
s = isqrt(n) # n changed: update s = sqrt(n)
else
d+= a[m] # from a small prime: m = mod(d,30)
= 2,3,5...,29
m = b[m] # fewer trials in [x+1:x+30] than
w. d+=2: 15->8
end
end
n > 1 && (h[n] = 1) # store final factor
return h
end
# using mod 2*3*5 = 30 coprimes for trial divisions.
const b = [7, 3, 5,0,
7,0,11,0,0,0,13,0,17,0,0,0,19,0,23,0,0,0,29,0,0,0,0,0,1] # traverse
coprimes to 30
const a = [6, 1, 2,0, 2,0, 4,0,0,0, 2,0, 4,0,0,0, 2,0, 4,0,0,0,
6,0,0,0,0,0,2] # steps: b[i]-i mod 30
const PRIMES3 = primes(10^4)[2:end] # 1228 primes > 2,
in Int vector (0.06s) to compute
const d0 = a[PRIMES3[end] % 30] + PRIMES3[end] # starting mod 30 coprime
trials
const m0 = b[PRIMES3[end] % 30] # ...from here
V = Array(Int, 63) # repeated factors, at most 63 terms (2^63)
W1 = Array(Int, 15) # prod(16 primes) > 2^64
W2 = Array(Int, 15)
end # module IntFactor #################################################
~~~ TIMING TESTS ~~~
using IntFactor
n1 = 2^48; n2 = n1+10^3
V = Array(Int,63)
factorV(9,V)
tic()
for i = n1:n2 # NO prime table, not collecting factors. -> last index in
external array
l = factorV(i,V)
end
println("factorV: $(round(toq(),2)) s")
factorU(9)
tic()
for i = n1:n2 # NO prime table, separate factor collection -> matrix
M = factorU(i)
end
println("factorU: $(round(toq(),2)) s")
factorX(9)
tic()
for i = n1:n2 # NO prime table, collected factors and exponents -> matrix
M = factorX(i)
end
println("factorX: $(round(toq(),2)) s")
factorW(9)
tic()
for i = n1:n2 # WITH prime table, collected factors and exponents ->
matrix
M = factorW(i)
end
println("factorW: $(round(toq(),2)) s")
factorP(9)
tic()
for i = n1:n2 # WITH prime table, collected factors and exponents ->
Dictionary
D = factorP(i)
end
println("factorP: $(round(toq(),2)) s")
factor(9)
tic()
for i = n1:n2 # Julia's factor (with prime table, all odd divisors) ->
Dictionary
D = factor(i)
end
println("-factor: $(round(toq(),2)) s")
print("Press Enter to exit"); readline(); quit() #######################
~~~ TIMING RESULTS ~~~
-- n1 = 1; n2 = 10^6
factorV: 1.11 s
factorU: 2.17 s
factorX: 1.81 s
factorW: 1.70 s
factorP: 1.59 s
-factor: 1.71 s
factorV: 1.07 s
factorU: 2.14 s
factorX: 1.74 s
factorW: 1.70 s
factorP: 1.56 s
-factor: 1.67 s
factorV: 1.04 s
factorU: 2.08 s
factorX: 1.82 s
factorW: 1.57 s
factorP: 1.50 s
-factor: 1.63 s
factorV: 1.04 s
factorU: 2.12 s
factorX: 1.80 s
factorW: 1.62 s
factorP: 1.44 s
-factor: 1.55 s
-- n1 = 1; n2 = 10^7
factorV: 22.42 s
factorU: 34.41 s
factorX: 32.69 s
factorW: 23.42 s
factorP: 22.08 s
-factor: 22.79 s
factorV: 22.92 s
factorU: 33.62 s
factorX: 30.52 s
factorW: 23.37 s
factorP: 22.18 s
-factor: 22.59 s
-- n1 = 1; n2 = 3*10^7
factorV: 100.83 s
factorU: 133.53 s
factorX: 127.57 s
factorW: 91.96 s
factorP: 87.80 s
-factor: 88.84 s
factorV: 102.5 s
factorU: 135.51 s
factorX: 127.69 s
factorW: 91.34 s
factorP: 85.88 s
-factor: 87.85 s
-- n1 = 1; n2 = 10^8
factorV: 555.87 s
factorU: 673.30 s
factorX: 662.81 s
factorW: 398.46 s
factorP: 401.33 s
-factor: 408.80 s
factorV: 561.25 s
factorU: 675.73 s
factorX: 664.98 s
factorW: 397.40 s
factorP: 403.60 s
-factor: 416.04 s
-- n1 = 2^48; n2 = n1+10^3
factorV: 5.46 s
factorU: 5.73 s
factorX: 6.59 s
factorW: 5.59 s
factorP: 5.68 s
-factor: 9.28 s
factorV: 5.73 s
factorU: 5.85 s
factorX: 5.80 s
factorW: 5.47 s
factorP: 5.74 s
-factor: 9.30 s
-- n1 = 2^48; n2 = n1+10^4
factorV: 59.04 s
factorU: 59.08 s
factorX: 60.67 s
factorW: 55.82 s
factorP: 64.95 s
-factor: 96.29 s
factorV: 55.27 s
factorU: 55.13 s
factorX: 59.79 s
factorW: 57.5 s
factorP: 58.38 s
-factor: 97.72 s
-- n1 = 2^56; n2 = n1+10^3
factorV: 75.59 s
factorU: 75.03 s
factorX: 92.25 s
factorW: 72.47 s
factorP: 71.09 s
-factor:124.81 s
factorV: 71.39 s
factorU: 73.33 s
factorX: 92.56 s
factorW: 73.45 s
factorP: 72.49 s
-factor:127.44 s