Edit: Corrected, forgot to change the new count routine to 48 from 8... We have proven with the straight translation of Daniel Bernstein's C code to golang that the Sieve of Eratosthenes (SoE) eratspeed runs about the same speed as the Sieve of Atkin (SoA) primespeed for the same range with the same optimizations and with John Barham's multi-treading for primespeed disabled (set numprocs = 1 in sieve.go), however, golang primespeed is slower with array bounds checks enabled due to the complexity of the array operations for the prime square free operations. For both I eliminated the use of the "two" shifting look up array table as that caused an extra bounds check, and tuned the inner loops in exactly the same way to minimize tight loop time.
Now that it is already shown that the SoA is inferior for page segmented sieving to 1,000,000,000 (one billion), as it only has about 258,000,000 operations whereas SoE eratspeed has about 405,000,000 operations, meaning that the SoA operations are more complex, which is obvious. However, Bernstein crippled eratspeed to only 2/3/5 wheel factorization where it is easily possible to use 2/3/5/7 plus pre-culling of the 11/13/17 primes to reduce the operations to about 263,000,000 operations, or about 35% less. The golang FasterEratspeed that implements this maximum wheel factorization version is posted at play.golang at: https://play.golang.org/p/iKA_s7ocuN and runs about 33% faster than the previous standard wheel factorized version, just as the ratio of operations suggests that it should; also, because it minimizes array access it doesn't matter much whether array bound checking is enabled or not. This is almost as fast as the highly optimized C++ primesieve (http://primesieve.org/) run single threaded on my cache access speed bottle-necked machine, but on faster machines it will likely be slower than primesieve, as much as twice as slow on high end Intel desktop CPU's. Actually, this implementation isn't maximally factorized as in some languages one is able to also pre-cull by 19 as well, but with golang the overhead of creating/handling the extra size of pattern array is about the same as the few percent gain. So, in this implementation we are coming very close to C/C++ speeds and faster than C#/Java; however, I learned much from the exercise and have found that very slight changes in the formulation of the tight inner loop(s) can slow the speed by at least a factor of two and sometimes up to almost three. These best optimizations aren't always obvious and it is time consuming trying the different optimizations to get the fastest code. Sometimes adding instructions actually reduces the execution speed, as the compiler will perhaps change the use of registers in a way that there are less stalls or bottle-necks. For this particular implementation and golang version 1.7beta1, it seems that for this particular loop, it is optimized quite effectively in spite of having the array bounds check and even better without it. That should be enough to prove that the SoA is just an interesting intellectual exercise, as for page seqmented sieves as these it has more complexity and is thus slower than the maximally factorized SoE, and also does not support its theoretical O(n) asymptotic complexity as with increasing range there is an ever increasing cost in code overhead, whereas the SoE comes very close to maintaining its theoretical O(n log log n) performance. To @Michael Jones: The error was that it was calculating the primes to over a billion but only counting one sixth of them, the culling time (which is almost all of the time) is the same. It wouldn't take any longer to show the last prime. So this program is almost six times faster than yours, as it likely should be in that it uses the extreme factorization to reduce the operations by about a factor of four from a simple odds only SoE. Yes, it is a little complicated doing the maximum wheel factorization, but not that much more complicated than the original C eratspeed; much of the extra code comes from implementing the pre-culling pattern and copying it into the page segment buffers, but about half the speed gain comes from that so it was worth it. This is a different algorithm than many SoE implementations, as rather than putting all wheel residuals in the same word or adjacent words, it uses separate bit-packed arrays for each residual, with one bit across all of the 48 residual array representing a range of 210. Thus, instead of a increased bit packing of a factor of two for odds-only sieves, there is a bit packing factor of 48 to 210 or a factor of over four. And it is to processing each residual separately that it owes its speed. It may look complicated, but less complicated than the extreme loop unrolling that Kim Waisch had to resort to in order to get the performance of primesieve. That said, this code was originally written to demonstrate a point and is nowhere near at a state for general use, with the following improvements needed to make it suitable for general use: 1) Instead of providing a "qtab" of pre-calculated base primes less than the square root of the desired range, a secondary stream of primes from the same program should be generated to feed back into the calculations; in that way this could be an "unbounded sieve" and one could draw primes from it almost without limit (other than time of course). 2) Similarly, the pre-initialized "next" array should be generated on-the-fly; otherwise the program can be multi-threaded, as every thread depends on the state of the "next" array, and it changes for every page segment. 3) In order to calculate the "next" array on the fly, it needs to be made even more efficient to calculate the start addresses per modulus residual. 4) Even with more efficient start address calculations by further Look Up Tables, as the range increases it will take an ever increasing percentage of the time to calculate the "next" array as the number of base primes increase; It would be better to let the page size increase automatically about as the square root of the range in order to keep this manageable. 5) If we let the page segment size increase, then we will lose cache associativity unless we implement a "double buffered" culling scheme where we cull for all the primes for one residual for a sub section of the page seqment which is L1 cache sized, then move up through the entire page segment before culling in the same manner for all residual segment plane arrays. 6) Once that is done, we can apply multi-threading, which will reduce the time on most 4 true core CPU's by about a factor of 4, so down to 0.07 seconds on your machine (if it is 4 core, you didn't mention what CPU version of Macbook Pro you have?). maybe slightly less if you hadn't yet compiled it with "-B" 60 to 70 ms for primes to one billion on a notebook isn't bad. That's about as fast as primesieve scaled by probable CPU clock speed. And once I do the above optimizations, it will scale to sieving to 10^14 in a few hours. I had thought of dropping it at this point, as I've learned most of what I intended, but I'm having fun and may continue. I won't build on the current program format as it isn't so conducive to an "unbounded sieve" and will probably move the good bits to a new structure. And of course the algorithm is applicable to any language, be it Go, Haskell, C/C++, Scheme, F#, C#, Java, Scala, Clojure, Kotlin, whatever... I have proven that the SoA is nothing as compared to a fully optimized SoE though. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.