I don't actually understand the underlying algorithm, but I at least understand the flow of the program and the structure. The algorithm utilized depends heavily on using shared memory access, which can be done in D, but I definitely wouldn't call it idiomatic. In D, message passing is preferred, but it really can't be well demonstrated on your algorithm without a deeper understanding of the algorithm itself.

A complete working version can be found at: https://gist.github.com/vnayar/79e2d0a9850833b8859dd9f08997b4d7

I modified both versions of the program to utilize the pgParameters13 for more of an apples-to-apples comparison.

The final results are as follows:
$ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim && echo "3000000000" | ./twinprimes_ssoz
Enter integer number: threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 175324676; resgroups = 1298702
each 135 threads has nextp[2 x 5566] array
setup time = 0.000 secs
perform twinprimes ssoz sieve
sieve time = 0.222 secs
last segment = 53518 resgroups; segment slices = 20
total twins = 9210144; last twin = 2999999712+/-1
total time = 0.223 secs

$ dub build --compiler=ldc2 -b=release && echo "3000000000" | ./twinprimes
Enter integer number:
threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 175324676; resgroups = 1298702
each 135 threads has nextp[2 x 5566] array
setup time = 1 ms, 864 μs, and 7 hnsecs
perform twinprimes ssoz sieve
sieve time = 196 ms, 566 μs, and 5 hnsecs
last segment = 53518 resgroups; segment slices = 20
total twins = 9210144; last twin = 2999999712+/- 1
total time = 198 ms, 431 μs, and 2 hnsecs

My understanding is that the difference in performance is largely due to slightly better optimization from the LLVM based ldc2 compiler, where I believe Nim is using gcc.

Hey Great!  I'll take some time and study your code.

Nim currently allows using gcc or clang: --cc:gcc or --cc:clang, and
compiles to C: nim c --cc:gcc... or C++: nim c++ --cc:gcc...

You can compile in Nim with/out using garbage collection (GC) and choose GC. This version needs GC to recover mem created in each thread, or it will eat it up. See operation of program using htop/top to see threads|mem at work.

If D has a fast bitarray structure, try it to create the data arrays "prms" in proc "sozpg" and "seg" arrays is "twin_sieves". This could allow for larger segment sizes|arrays that fit into cpu cache, reducing the number of segment slices to process. This would all have to be profiled and encapsulated in "selectPG", which adaptively selects to best PG to use.

Here is a table of output data and performance times (in seconds).

The results were produced on a System76 laptop with an Intel I7 6700HQ cpu, 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. I compiled the file "twinprimes_ssoz.nim" with Nim versions 0.18.0, and the just released 0.19.0, using gcc 8.2.1 with Manjaro (Linux) in Virtual Box. I then ran the executables on my host Linux system (which only has gcc-4.9.8) to use all 8 threads.

The output was produced on a "quiet" system, where I turned off the laptop and rebooted, and ran tests with only a terminal and spreedsheet open (to record
results). The times are the "best" times from multiple runs.

I also compare times from the program "primesieve" - https://primesieve.org/
which states on its site:

------------------------------------------------------------------------------------
primesieve is a free software program and C/C++ library that generates primes using a highly optimized sieve of Eratosthenes implementation. It counts the primes below 10^10 in just 0.45 seconds on an Intel Core i7-6700 CPU (4 x 3.4 GHz). primesieve can generate primes and prime k‑tuplets up to 2^64.
------------------------------------------------------------------------------------

(An optimized C/C++ versions of twinprimes_ssoz would be nice to compare against too.)


Number | Twimprimes | Last Twinprime |twinprime_ssoz.nim, gcc-8.2.1| primesieve | | | | 0.18.0 | 0.19.0 | 9.7.1 |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^09 | 3424506 | 1999999872 | 0.043 | 0.041 | 0.049 |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^09 | 14618166 | 4999999860 | 0.219 | 0.226 | 0.248 |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^10 | 27412679 | 9999999702 | 0.398 | 0.401 | 0.533 |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^10 | 118903682 | 49999999590 | 2.364 | 2.331 | 2.978 |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^11 | 224376048 | 99999999762 | 4.586 | 4.476 | 6.189 |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^11 | 986222314 | 499999999062 | 26.625 | 26.578 | 35.120 |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^12 | 1870585220 | 999999999960 | 57.895 | 58.195 | 73.966 |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^12 | 8312493003 | 4999999999878 | 385.562 | 389.347 | 433.320 |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^13 |15834664872 | 9999999998490 | 855.705 | 857.771 | 934.253 |
----------|------------|----------------|--------------|--------------|------------|
2 x 10^13 |30198862775 | 19999999999830 | 1806.282 | 1818.766 | 1998.341 |
----------|------------|----------------|--------------|--------------|------------|

A typical output of the "twinprimes_ssoz.nim" files looks as below"

$ echo 123_345_789_000 | ./twinprimes_test7yc.0180.gcc821
Enter integer number: threads = 8
each thread segment is [1 x 98304] bytes array // 98,304 segment size (Kn) twinprime candidates = 6099517038; resgroups = 4107419 // max twins; max cols (KB) each 1485 threads has nextp[2 x 30062] array // 30,062, primes <= sqrt(N) setup time = 0.006 secs // time for primes|ssoz setup perform twinprimes ssoz sieve // ssoz starts now sieve time = 5.771 secs // time to do just the ssoz last segment = 76955 resgroups; segment slices = 42 // last seg cols; seg loops total twins = 272018910; last twin = 123345788640+/-1 // twinprimes, val for last total time = 5.778 secs // setup + ssoz time

You can determine the selected PG by the number of threads being used.

I'm writing up the paper, and will probably release the program operation description to provide a detailed understanding of how|why it works, which will allow for modification.

Reply via email to