Fastest Prime Sieve, in Nim

GordonBGood Thu, 23 Dec 2021 23:45:13 -0800

@shirleyquirk:

> I don't like my hack, hopefully someone else can help make it cleaner...


I don't like it either, or this algorithm in general, which is why I haven't 
worked on it directly, at least so far...

As you have identified, the biggest problem with the implementation is that it 
uses Nim's experimental `spawn` feature that requires copying `seq`'s between 
threads, which feature will likely soon be replaced with something better to 
avoid the copying between the foreground and background threads. As it 
currently exists, it will never be very satisfactory, so I would likely 
implement this manually using raw Threads and passing the required data across 
using pointers, which although still not "pretty" means it is more likely to 
work reliably.

However, using 12.2 Gigabytes for this algorithm is extremely high (although I 
don't know the number of threads you are using). Also, it looks like the 
Crystal implementation automatically limits the number of threads used, perhaps 
for this same reason so that the print out shows Crystal only using three 
threads where eight seem to be available. The base prime value`seq` stores 
eight-byte `int`'s and for sieving over the maximum range there should be about 
200 million of these so 1.6 Gigabytes per thread. The page-segment memory used 
per thread is only 384 Kilobytes maximum an a pair of these per thread would 
only be six Megabytes, so that is negligible. Not copying the array is a huge 
step in the right direction, as then the one base primes array serves all 
threads.

Another thing that would be easy to do would be reduce the stored base prime 
`seq` array to `uint32` as that is always enough and just cast them up as 
required, which would reduce the memory use by a factor of two. A slightly 
harder thing to do would be encode the base primes as single byte delta's since 
the base prime values are all odd and the larges prime gap up to the 32-bit 
number range is 354 but if encoded knowing all gaps are even then the larges 
delta is only 177, which is within the 255 range of a single byte. This would 
reduce memory use for the base prime value `seq` another factor of four to 
about 200 Megabytes, which wouldn't be a problem. However, this isn't likely 
going to entirely fix the memory use problem, as it looks like the 
implementation uses huge "nextp" `seq`'s of the same length as the number of 
base primes, with one entry for each of the two bit planes each thread handles. 
This is over three Gigabytes per thread (and they can't be shared) although 
there is no real need for each entry to be 64-bit values and 32 -bit values 
would reduce this by a factor of two as above. A better solution would be to 
use a more efficient page-segment start address calculation, although that 
would require fairly extensive re-writing and I would recommend my alternative 
implementation described below.

Taking 22 seconds to initialize to this range, which should mostly be sieving 
the maximum base prime value range, is also ridiculously slow as it should take 
only a second or two even single-threaded and can be optimized to take only a 
fraction of a second; of course part of the reason this is so slow is that, 
although it uses some wheel factorization, it uses the "one-huge-memory array" 
technique which suffers horribly from "cache-thrashing". In a similar vein, 
taking almost 20 seconds to do a tiny page-segment sieve of a few hundred 
totative bit planes is also very slow, especially using several threads to do 
so. For instance, Kim Walisch's "primesieve" counts the twin primes over the 
small range at near the high limit in a couple of seconds, even single threaded.

As to the problems with the algorithm, near the top of this thread, I replied 
to the OP, BLM2 (which is yet another account name by the Crystal forum's 
contributor, Zakiya, and the author of this algorithm) giving my objections to 
his work, which I will state again as follows:

  1. For years he has persisted in claiming the basis of the algorithm as his 
own, now calling it the Segmented Sieve of Zakiya (SSoZ), when it is nothing 
but a version of the Page-Segmented Sieve of Eratosthenes (SoE) with a slightly 
different form of wheel factorization, and refuses to acknowledge previous work 
that also uses this slightly different form such as [that of Andrew 
Booker](https://primes.utm.edu/nthprime/algorithm.php) or as per the SoE 
reference implementations for comparison with the implementation of the Sieve 
of Atkin [as per Daniel Bernstein's work in the eratspeed program as part of 
that project](http://cr.yp.to/primegen.html).
  2. Wheel factorization is often implemented with the wheel totatives each 
represented by one binary bit in a byte, word, or several successive words; for 
instance, for the wheel formed from the primes 2/3/5 which has eight totatives 
of 1/7/11/13/17/19/23/29 if including one or 7/11/13/17/19/23/29/31 if skipping 
past one, all totatives fit in one byte with each bit representing one binary 
value, either composite or not, and with successive bytes containing these 
values incremented by steps of 30.
  3. In Zakiya's version in common with other previous work such as the above 
links, instead of combining all wheel factor totatives into one word or set of 
words, a separate bit plane is used for each of the totatives, so there would 
be eight bit-packed arrays uses for the above example.
  4. For finding twin primes as per this implementation, two of these 
bit-packed totative arraysare used at a time by each thread so they can be 
checked for twins. For example, if the above wheel using primes up to 5 were 
used and the value one were skipped, the twin prime check pairs would be 
(1,2)/(3/4)/(6,7), which could be run on three threads
  5. As the working range goes up, the size of the totative bit plane arrays 
would increase to ineffecient sizes, so his algorithm divides this into smaller 
page segments which are a maximum of 384 Kilobytes or about the CPU L2 cache 
size.
  6. For very large upper limits, the implementation increases the "degree" of 
wheel factorization up to and including the prime of 17, which increases the 
number of totatives to 92160 and provides many more possible thread 
applications with one threaded task per pair of admissible bit planes, also 
keeping the number of page-segments required per bit plane to something quite 
small. Page-segmentation was required as the implementation allows looking for 
twin primes between a small range between a start value and an end value as 
well as to keep the memory use low.
  7. If implemented correctly with a thread pool of only the maximum number of 
CPU threads, the maximum memory use per thread should be the maximum page 
segment size of 384 Kilobytes times two for the twin-primes search, plus 
storage for all of the base primes up to whatever upper limit as described 
above, but with the problem that the implementation requires a huge set of 
"nextp" `seq`'s.



My preferred more general way of implementing this would be as follows:

  1. Instead of a variable number of bit planes with range, I would choose a 
reasonable wheel factorization using the primes up to say 7, so 48 totative bit 
planes.
  2. I would choose a page-segment size of fixed increments whose size again 
goes up with range as its span in bits needs to be at least as big as the 
square root of the represented top range per page, which is the maximum value 
of base prime required to cull the composites from the page; this isn't that 
different the the Zakiya implementation other than the above fixed number of 
totatives.
  3. Then, each page-segment would consist of all of the admissible totative 
bit planes using the entire page-segment bit plane buffer for each, with 
successive pages assigned to different threads from the thread pool, whose work 
would include comparing the admissible pairs for bit planes and counting the 
occurrences where a given bit position is prime for pairs of totative bit 
planes. In other words, instead of each thread handling only one pair of 
totative bit planes, each thread would handle all admissible totative bit 
planes and include the finding and counting of twin primes in adjacent pairs.
  4. The feeding and retrieving of results from the thread pool would continue 
until the entire range has been processed.
  5. The method is somewhat superior to that of Zakiya as it allows better 
optimizations in culling, but especially in calculation of the culling start 
positions per totative bit plane largely through Look Up Table's as I do for my 
implementation of [the Page-Segmented Sieve of Eratosthenes in a StackOverflow 
answer](https://stackoverflow.com/a/57108107/549617), or [the same in F# on 
another SO answer](https://stackoverflow.com/a/61057615/549617), which includes 
multi-threading and other optimizations (although limited by running on a JIT 
platform).
  6. Adding the processing to, rather than just count primes, skip the 
inadmissible totatives not able to be a part of twin primes and to or/and bit 
planes together to count twin primes rather than just primes is quite simple 
and could be easily accomplished by translating that code.
  7. I have written this in Nim, but I need to update it for the latest Nim 
versions and do a little work to improve the multi-threading. It is about as 
fast as Kim Walisch's prime sieve, and if I added the logic to skip the 
inadmissible totative bit planes would be even faster, although if one wants 
the very highest speeds with upper limits above about 1e14, there is no 
avoiding using prime arrays totaling up to about 1.6 Gigabytes per thread for 
the highest limit, although these aren't used in the same way as the "nextp" 
arrays (they are part of a "bucket sieve" implementation).

Fastest Prime Sieve, in Nim

Reply via email to