Re: Fastest Prime Sieve, in Nim

BLM2 Thu, 27 Jun 2019 05:15:09 -0700

A quick update.

I did some simple coarse optimizations that give some appreciable speed 
increases. But first some Nim issues to be aware of/correct.


**Erata and Nim bugs**

1) In proc `twins_sieve` (line 254), the `seg` array is inititalized as

`var seg = newSeq[uint8](KB shr 3)`

This could cause a runtime error. It could be one byte short, or be initialized 
effectively to `var seg = newSeq[uint8](0)` for small Kb. Compiling with 
`-d:danger` creates no runtime errors, and everything still works, but I don't 
know why (the D version produces runtime errors where expected, link below).

[https://gist.github.com/jzakiya/ae93bfa03dbc8b25ccc7f97ff8ad0f61](https://gist.github.com/jzakiya/ae93bfa03dbc8b25ccc7f97ff8ad0f61)

I've now coded it correctly (as originally done) so it's now always at least 1 
byte long and gives the correct length for all Kb values.

`var seg = newSeq[uint8]((KB shr 3) + 1)`

2) Starting with Nim 0.19.6 to present (0.20.0), the program sometimes hangs 
(needing a hard abort) for some small values (< 1e10). At first I thought it 
had something to do with the progress indicator on line 341 in the main proc 
`twinprimes_ssoz`, but even after I commented that line out it still hangs, so 
it must be a pure threading issue. Something must have changed since 0.19.4.0.

**Performance Increases**

As stated previously, a simple straightforward way (no big code changes needed) 
to increase speed is to better `fine tune` the segment sizes. Line 309

`let B = Bn * 1024 * 8 # set seg size to optimize for selected`

provides a very coarse way to simply experiment with segment sizes. Below are 
updated results of change to factor `n` for `Bn * 1024 * n`. I found using 
smaller values 4 and 6 for `n` gives appreciable better times for lower values 
than using 8 as a constant factor. Some results below.
    
    
                N          |   n = 8    |   n = 6    |   n = 4   |
        -------------------|------------|------------|-----------|
           100_000_000_000 |     4.628  |     4.624  |     4.684 |
           500_000_000_000 |    24.976  |    23.711  |    23.308 |
         1_000_000_000_000 |    50.713  |    48.298  |    47.981 |
         5_000_000_000_000 |   291.095  |   284.336  |   265.746 |
        10_000_000_000_000 |   621.321  |   618.032  |   575.923 |
        50_000_000_000_000 |  3028.456  |  3009.485  |  3038.179 |
       100_000_000_000_000 |  6307.521  |  6371.226  |           |
    
    
    Run

Basically n = 4 is fastest for N ~< 4e13, n = 6 for ~4e13 < N ~< 1e14, and n = 
8, N >~ 1e14, with possibly n increasing as N increases.

I've now incorporated this switching scenario in the current code to reflect 
this.
    
    
      let range = Kmax - Kmin + 1      # set number of range resgroups, min of 1
      let n = if range < 37_500_000_000_000'u: 4 elif range < 
975_000_000_000_000'u: 6 else: 8
      let B = Bn * 1024 * n            # set seg size to optimize for selected 
PG
    
    
    Run

[https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e](https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e)

What I really need to do is recalibrate the settings in `selectPG`. Those 
settings are for the old (pre June 2019) implementation, and need to be fine 
tuned for the current implementation. This coarse tweaking shows there are 
likely more speedups from merely tweaking the PG cross-overs and segment sizes 
for given range values.

Another likely speedup is for P5, which now uses just 3 threads for its 3 
twinpairs residues. My I7 has 8 threads, so I can divide the range for P5 into 
2 equal sections for each twinpair and use 6 threads in parallel. This needs 
just a small change to the code to achieve.

More extensive speedups are achievable by eliminating all/most/many of the 
calculations done in `nextp_init`, similar to what `primesieve` does.

However, `primesieve` actually performs 3 different algorithms, for `small`, 
`medium`, and `large` ranges, using (large) precomputed wheel (PG) tables of 
constants. The `large` algorithm is very fast doing this, while trading off 
code size. (`primesieve` is a very large code base, with dozens of files 
totaling into thousands of lines of C++ code.)

Some people have noted appreciation that I can perform in 1 file of ~300 lines 
of Nim code (minus line comments) so much better. (At least up to around 1e14+ 
range. It takes so much time to run tests above that, I'm resource limited to 
fully verify optimizing it over the full 64-bit range.)

However, with the small simple changes presented here, the implementation is 
now even faster than `primesieve` over the tested ranges.

Re: Fastest Prime Sieve, in Nim

Reply via email to