A quick update. I did some simple coarse optimizations that give some appreciable speed increases. But first some Nim issues to be aware of/correct.
**Erata and Nim bugs** 1) In proc `twins_sieve` (line 254), the `seg` array is inititalized as `var seg = newSeq[uint8](KB shr 3)` This could cause a runtime error. It could be one byte short, or be initialized effectively to `var seg = newSeq[uint8](0)` for small Kb. Compiling with `-d:danger` creates no runtime errors, and everything still works, but I don't know why (the D version produces runtime errors where expected, link below). [https://gist.github.com/jzakiya/ae93bfa03dbc8b25ccc7f97ff8ad0f61](https://gist.github.com/jzakiya/ae93bfa03dbc8b25ccc7f97ff8ad0f61) I've now coded it correctly (as originally done) so it's now always at least 1 byte long and gives the correct length for all Kb values. `var seg = newSeq[uint8]((KB shr 3) + 1)` 2) Starting with Nim 0.19.6 to present (0.20.0), the program sometimes hangs (needing a hard abort) for some small values (< 1e10). At first I thought it had something to do with the progress indicator on line 341 in the main proc `twinprimes_ssoz`, but even after I commented that line out it still hangs, so it must be a pure threading issue. Something must have changed since 0.19.4.0. **Performance Increases** As stated previously, a simple straightforward way (no big code changes needed) to increase speed is to better `fine tune` the segment sizes. Line 309 `let B = Bn * 1024 * 8 # set seg size to optimize for selected` provides a very coarse way to simply experiment with segment sizes. Below are updated results of change to factor `n` for `Bn * 1024 * n`. I found using smaller values 4 and 6 for `n` gives appreciable better times for lower values than using 8 as a constant factor. Some results below. N | n = 8 | n = 6 | n = 4 | -------------------|------------|------------|-----------| 100_000_000_000 | 4.628 | 4.624 | 4.684 | 500_000_000_000 | 24.976 | 23.711 | 23.308 | 1_000_000_000_000 | 50.713 | 48.298 | 47.981 | 5_000_000_000_000 | 291.095 | 284.336 | 265.746 | 10_000_000_000_000 | 621.321 | 618.032 | 575.923 | 50_000_000_000_000 | 3028.456 | 3009.485 | 3038.179 | 100_000_000_000_000 | 6307.521 | 6371.226 | | Run Basically n = 4 is fastest for N ~< 4e13, n = 6 for ~4e13 < N ~< 1e14, and n = 8, N >~ 1e14, with possibly n increasing as N increases. I've now incorporated this switching scenario in the current code to reflect this. let range = Kmax - Kmin + 1 # set number of range resgroups, min of 1 let n = if range < 37_500_000_000_000'u: 4 elif range < 975_000_000_000_000'u: 6 else: 8 let B = Bn * 1024 * n # set seg size to optimize for selected PG Run [https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e](https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e) What I really need to do is recalibrate the settings in `selectPG`. Those settings are for the old (pre June 2019) implementation, and need to be fine tuned for the current implementation. This coarse tweaking shows there are likely more speedups from merely tweaking the PG cross-overs and segment sizes for given range values. Another likely speedup is for P5, which now uses just 3 threads for its 3 twinpairs residues. My I7 has 8 threads, so I can divide the range for P5 into 2 equal sections for each twinpair and use 6 threads in parallel. This needs just a small change to the code to achieve. More extensive speedups are achievable by eliminating all/most/many of the calculations done in `nextp_init`, similar to what `primesieve` does. However, `primesieve` actually performs 3 different algorithms, for `small`, `medium`, and `large` ranges, using (large) precomputed wheel (PG) tables of constants. The `large` algorithm is very fast doing this, while trading off code size. (`primesieve` is a very large code base, with dozens of files totaling into thousands of lines of C++ code.) Some people have noted appreciation that I can perform in 1 file of ~300 lines of Nim code (minus line comments) so much better. (At least up to around 1e14+ range. It takes so much time to run tests above that, I'm resource limited to fully verify optimizing it over the full 64-bit range.) However, with the small simple changes presented here, the implementation is now even faster than `primesieve` over the tested ranges.
