Edit:  Corrected, forgot to change the new count routine to 48 from 8...

We have proven with the straight translation of Daniel Bernstein's C code to 
golang that the Sieve of Eratosthenes (SoE) eratspeed runs about the same speed 
as the Sieve of Atkin (SoA) primespeed for the same range with the same 
optimizations and with John Barham's multi-treading for primespeed disabled 
(set numprocs = 1 in sieve.go), however, golang primespeed is slower with array 
bounds checks enabled due to the complexity of the array operations for the 
prime square free operations.  For both I eliminated the use of the "two" 
shifting look up array table as that caused an extra bounds check, and tuned 
the inner loops in exactly the same way to minimize tight loop time. 

Now that it is already shown that the SoA is inferior for page segmented 
sieving to 1,000,000,000 (one billion), as it only has about 258,000,000 
operations whereas SoE eratspeed has about 405,000,000 operations, meaning that 
the SoA operations are more complex, which is obvious.  However, Bernstein 
crippled eratspeed to only 2/3/5 wheel factorization where it is easily 
possible to use 2/3/5/7 plus pre-culling of the 11/13/17 primes to reduce the 
operations to about 263,000,000 operations, or about 35% less. 

The golang FasterEratspeed that implements this maximum wheel factorization 
version is posted at play.golang at:  https://play.golang.org/p/iKA_s7ocuN and 
runs about 33% faster than the previous standard wheel factorized version, just 
as the ratio of operations suggests that it should; also, because it minimizes 
array access it doesn't matter much whether array bound checking is enabled or 
not.  This is almost as fast as the highly optimized C++ primesieve 
(http://primesieve.org/) run single threaded on my cache access speed 
bottle-necked machine, but on faster machines it will likely be slower than 
primesieve, as much as twice as slow on high end Intel desktop CPU's. 

Actually, this implementation isn't maximally factorized as in some languages 
one is able to also pre-cull by 19 as well, but with golang the overhead of 
creating/handling the extra size of pattern array is about the same as the few 
percent gain. 

So, in this implementation we are coming very close to C/C++ speeds and faster 
than C#/Java; however, I learned much from the exercise and have found that 
very slight changes in the formulation of the tight inner loop(s) can slow the 
speed by at least a factor of two and sometimes up to almost three.  These best 
optimizations aren't always obvious and it is time consuming trying the 
different optimizations to get the fastest code.  Sometimes adding instructions 
actually reduces the execution speed, as the compiler will perhaps change the 
use of registers in a way that there are less stalls or bottle-necks.  For this 
particular implementation and golang version 1.7beta1, it seems that for this 
particular loop, it is optimized quite effectively in spite of having the array 
bounds check and even better without it. 

That should be enough to prove that the SoA is just an interesting intellectual 
exercise, as for page seqmented sieves as these it has more complexity and is 
thus slower than the maximally factorized SoE, and also does not support its 
theoretical O(n) asymptotic complexity as with increasing range there is an 
ever increasing cost in code overhead, whereas the SoE comes very close to 
maintaining its theoretical O(n log log n) performance.

To @Michael Jones:  The error was that it was calculating the primes to over a 
billion but only counting one sixth of them, the culling time (which is almost 
all of the time) is the same.  It wouldn't take any longer to show the last 
prime.  So this program is almost six times faster than yours, as it likely 
should be in that it uses the extreme factorization to reduce the operations by 
about a factor of four from a simple odds only SoE.

Yes, it is a little complicated doing the maximum wheel factorization, but not 
that much more complicated than the original C eratspeed; much of the extra 
code comes from implementing the pre-culling pattern and copying it into the 
page segment buffers, but about half the speed gain comes from that so it was 
worth it.  This is a different algorithm than many SoE implementations, as 
rather than putting all wheel residuals in the same word or adjacent words, it 
uses separate bit-packed arrays for each residual, with one bit across all of 
the 48 residual array representing a range of 210.  Thus, instead of a 
increased bit packing of a factor of two for odds-only sieves, there is a bit 
packing factor of 48 to 210 or a factor of over four.  And it is to processing 
each residual separately that it owes its speed.  It may look complicated, but 
less complicated than the extreme loop unrolling that Kim Waisch had to resort 
to in order to get the performance of primesieve.

That said, this code was originally written to demonstrate a point and is 
nowhere near at a state for general use, with the following improvements needed 
to make it suitable for general use:

1) Instead of providing a "qtab" of pre-calculated base primes less than the 
square root of the desired range, a secondary stream of primes from the same 
program should be generated to feed back into the calculations; in that way 
this could be an "unbounded sieve" and one could draw primes from it almost 
without limit (other than time of course).

2) Similarly, the pre-initialized "next" array should be generated on-the-fly; 
otherwise the program can be multi-threaded, as every thread depends on the 
state of the "next" array, and it changes for every page segment.

3) In order to calculate the "next" array on the fly, it needs to be made even 
more efficient to calculate the start addresses per modulus residual.

4) Even with more efficient start address calculations by further Look Up 
Tables, as the range increases it will take an ever increasing percentage of 
the time to calculate the "next" array as the number of base primes increase; 
It would be better to let the page size increase automatically about as the 
square root of the range in order to keep this manageable.

5) If we let the page segment size increase, then we will lose cache 
associativity unless we implement a "double buffered" culling scheme where we 
cull for all the primes for one residual for a sub section of the page seqment 
which is L1 cache sized, then move up through the entire page segment before 
culling in the same manner for all residual segment plane arrays.

6) Once that is done, we can apply multi-threading, which will reduce the time 
on most 4 true core CPU's by about a factor of 4, so down to 0.07 seconds on 
your machine (if it is 4 core, you didn't mention what CPU version of Macbook 
Pro you have?). maybe slightly less if you hadn't yet compiled it with "-B"

60 to 70 ms for primes to one billion on a notebook isn't bad.  That's about as 
fast as primesieve scaled by probable CPU clock speed.  And once I do the above 
optimizations, it will scale to sieving to 10^14 in a few hours.

I had thought of dropping it at this point, as I've learned most of what I 
intended, but I'm having fun and may continue.  I won't build on the current 
program format as it isn't so conducive to an "unbounded sieve" and will 
probably move the good bits to a new structure.  And of course the algorithm is 
applicable to any language, be it Go, Haskell, C/C++, Scheme, F#, C#, Java, 
Scala, Clojure, Kotlin, whatever...

I have proven that the SoA is nothing as compared to a fully optimized SoE 
though.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to