@shirleyquirk: > I don't like my hack, hopefully someone else can help make it cleaner...
I don't like it either, or this algorithm in general, which is why I haven't worked on it directly, at least so far... As you have identified, the biggest problem with the implementation is that it uses Nim's experimental `spawn` feature that requires copying `seq`'s between threads, which feature will likely soon be replaced with something better to avoid the copying between the foreground and background threads. As it currently exists, it will never be very satisfactory, so I would likely implement this manually using raw Threads and passing the required data across using pointers, which although still not "pretty" means it is more likely to work reliably. However, using 12.2 Gigabytes for this algorithm is extremely high (although I don't know the number of threads you are using). Also, it looks like the Crystal implementation automatically limits the number of threads used, perhaps for this same reason so that the print out shows Crystal only using three threads where eight seem to be available. The base prime value`seq` stores eight-byte `int`'s and for sieving over the maximum range there should be about 200 million of these so 1.6 Gigabytes per thread. The page-segment memory used per thread is only 384 Kilobytes maximum an a pair of these per thread would only be six Megabytes, so that is negligible. Not copying the array is a huge step in the right direction, as then the one base primes array serves all threads. Another thing that would be easy to do would be reduce the stored base prime `seq` array to `uint32` as that is always enough and just cast them up as required, which would reduce the memory use by a factor of two. A slightly harder thing to do would be encode the base primes as single byte delta's since the base prime values are all odd and the larges prime gap up to the 32-bit number range is 354 but if encoded knowing all gaps are even then the larges delta is only 177, which is within the 255 range of a single byte. This would reduce memory use for the base prime value `seq` another factor of four to about 200 Megabytes, which wouldn't be a problem. However, this isn't likely going to entirely fix the memory use problem, as it looks like the implementation uses huge "nextp" `seq`'s of the same length as the number of base primes, with one entry for each of the two bit planes each thread handles. This is over three Gigabytes per thread (and they can't be shared) although there is no real need for each entry to be 64-bit values and 32 -bit values would reduce this by a factor of two as above. A better solution would be to use a more efficient page-segment start address calculation, although that would require fairly extensive re-writing and I would recommend my alternative implementation described below. Taking 22 seconds to initialize to this range, which should mostly be sieving the maximum base prime value range, is also ridiculously slow as it should take only a second or two even single-threaded and can be optimized to take only a fraction of a second; of course part of the reason this is so slow is that, although it uses some wheel factorization, it uses the "one-huge-memory array" technique which suffers horribly from "cache-thrashing". In a similar vein, taking almost 20 seconds to do a tiny page-segment sieve of a few hundred totative bit planes is also very slow, especially using several threads to do so. For instance, Kim Walisch's "primesieve" counts the twin primes over the small range at near the high limit in a couple of seconds, even single threaded. As to the problems with the algorithm, near the top of this thread, I replied to the OP, BLM2 (which is yet another account name by the Crystal forum's contributor, Zakiya, and the author of this algorithm) giving my objections to his work, which I will state again as follows: 1. For years he has persisted in claiming the basis of the algorithm as his own, now calling it the Segmented Sieve of Zakiya (SSoZ), when it is nothing but a version of the Page-Segmented Sieve of Eratosthenes (SoE) with a slightly different form of wheel factorization, and refuses to acknowledge previous work that also uses this slightly different form such as [that of Andrew Booker](https://primes.utm.edu/nthprime/algorithm.php) or as per the SoE reference implementations for comparison with the implementation of the Sieve of Atkin [as per Daniel Bernstein's work in the eratspeed program as part of that project](http://cr.yp.to/primegen.html). 2. Wheel factorization is often implemented with the wheel totatives each represented by one binary bit in a byte, word, or several successive words; for instance, for the wheel formed from the primes 2/3/5 which has eight totatives of 1/7/11/13/17/19/23/29 if including one or 7/11/13/17/19/23/29/31 if skipping past one, all totatives fit in one byte with each bit representing one binary value, either composite or not, and with successive bytes containing these values incremented by steps of 30. 3. In Zakiya's version in common with other previous work such as the above links, instead of combining all wheel factor totatives into one word or set of words, a separate bit plane is used for each of the totatives, so there would be eight bit-packed arrays uses for the above example. 4. For finding twin primes as per this implementation, two of these bit-packed totative arraysare used at a time by each thread so they can be checked for twins. For example, if the above wheel using primes up to 5 were used and the value one were skipped, the twin prime check pairs would be (1,2)/(3/4)/(6,7), which could be run on three threads 5. As the working range goes up, the size of the totative bit plane arrays would increase to ineffecient sizes, so his algorithm divides this into smaller page segments which are a maximum of 384 Kilobytes or about the CPU L2 cache size. 6. For very large upper limits, the implementation increases the "degree" of wheel factorization up to and including the prime of 17, which increases the number of totatives to 92160 and provides many more possible thread applications with one threaded task per pair of admissible bit planes, also keeping the number of page-segments required per bit plane to something quite small. Page-segmentation was required as the implementation allows looking for twin primes between a small range between a start value and an end value as well as to keep the memory use low. 7. If implemented correctly with a thread pool of only the maximum number of CPU threads, the maximum memory use per thread should be the maximum page segment size of 384 Kilobytes times two for the twin-primes search, plus storage for all of the base primes up to whatever upper limit as described above, but with the problem that the implementation requires a huge set of "nextp" `seq`'s. My preferred more general way of implementing this would be as follows: 1. Instead of a variable number of bit planes with range, I would choose a reasonable wheel factorization using the primes up to say 7, so 48 totative bit planes. 2. I would choose a page-segment size of fixed increments whose size again goes up with range as its span in bits needs to be at least as big as the square root of the represented top range per page, which is the maximum value of base prime required to cull the composites from the page; this isn't that different the the Zakiya implementation other than the above fixed number of totatives. 3. Then, each page-segment would consist of all of the admissible totative bit planes using the entire page-segment bit plane buffer for each, with successive pages assigned to different threads from the thread pool, whose work would include comparing the admissible pairs for bit planes and counting the occurrences where a given bit position is prime for pairs of totative bit planes. In other words, instead of each thread handling only one pair of totative bit planes, each thread would handle all admissible totative bit planes and include the finding and counting of twin primes in adjacent pairs. 4. The feeding and retrieving of results from the thread pool would continue until the entire range has been processed. 5. The method is somewhat superior to that of Zakiya as it allows better optimizations in culling, but especially in calculation of the culling start positions per totative bit plane largely through Look Up Table's as I do for my implementation of [the Page-Segmented Sieve of Eratosthenes in a StackOverflow answer](https://stackoverflow.com/a/57108107/549617), or [the same in F# on another SO answer](https://stackoverflow.com/a/61057615/549617), which includes multi-threading and other optimizations (although limited by running on a JIT platform). 6. Adding the processing to, rather than just count primes, skip the inadmissible totatives not able to be a part of twin primes and to or/and bit planes together to count twin primes rather than just primes is quite simple and could be easily accomplished by translating that code. 7. I have written this in Nim, but I need to update it for the latest Nim versions and do a little work to improve the multi-threading. It is about as fast as Kim Walisch's prime sieve, and if I added the logic to skip the inadmissible totative bit planes would be even faster, although if one wants the very highest speeds with upper limits above about 1e14, there is no avoiding using prime arrays totaling up to about 1.6 Gigabytes per thread for the highest limit, although these aren't used in the same way as the "nextp" arrays (they are part of a "bucket sieve" implementation).
