I don't actually understand the underlying algorithm, but I at
least understand the flow of the program and the structure.
The algorithm utilized depends heavily on using shared memory
access, which can be done in D, but I definitely wouldn't call
it idiomatic. In D, message passing is preferred, but it
really can't be well demonstrated on your algorithm without a
deeper understanding of the algorithm itself.
A complete working version can be found at:
https://gist.github.com/vnayar/79e2d0a9850833b8859dd9f08997b4d7
I modified both versions of the program to utilize the
pgParameters13 for more of an apples-to-apples comparison.
The final results are as follows:
$ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim
&& echo "3000000000" | ./twinprimes_ssoz
Enter integer number: threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 175324676; resgroups = 1298702
each 135 threads has nextp[2 x 5566] array
setup time = 0.000 secs
perform twinprimes ssoz sieve
sieve time = 0.222 secs
last segment = 53518 resgroups; segment slices = 20
total twins = 9210144; last twin = 2999999712+/-1
total time = 0.223 secs
$ dub build --compiler=ldc2 -b=release && echo "3000000000" |
./twinprimes
Enter integer number:
threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 175324676; resgroups = 1298702
each 135 threads has nextp[2 x 5566] array
setup time = 1 ms, 864 μs, and 7 hnsecs
perform twinprimes ssoz sieve
sieve time = 196 ms, 566 μs, and 5 hnsecs
last segment = 53518 resgroups; segment slices = 20
total twins = 9210144; last twin = 2999999712+/- 1
total time = 198 ms, 431 μs, and 2 hnsecs
My understanding is that the difference in performance is
largely due to slightly better optimization from the LLVM based
ldc2 compiler, where I believe Nim is using gcc.
Hey Great! I'll take some time and study your code.
Nim currently allows using gcc or clang: --cc:gcc or --cc:clang,
and
compiles to C: nim c --cc:gcc... or C++: nim c++ --cc:gcc...
You can compile in Nim with/out using garbage collection (GC) and
choose GC.
This version needs GC to recover mem created in each thread, or
it will eat
it up. See operation of program using htop/top to see threads|mem
at work.
If D has a fast bitarray structure, try it to create the data
arrays
"prms" in proc "sozpg" and "seg" arrays is "twin_sieves". This
could allow
for larger segment sizes|arrays that fit into cpu cache, reducing
the number
of segment slices to process. This would all have to be profiled
and
encapsulated in "selectPG", which adaptively selects to best PG
to use.
Here is a table of output data and performance times (in seconds).
The results were produced on a System76 laptop with an Intel I7
6700HQ cpu,
2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. I compiled
the file
"twinprimes_ssoz.nim" with Nim versions 0.18.0, and the just
released 0.19.0,
using gcc 8.2.1 with Manjaro (Linux) in Virtual Box. I then ran
the executables
on my host Linux system (which only has gcc-4.9.8) to use all 8
threads.
The output was produced on a "quiet" system, where I turned off
the laptop and
rebooted, and ran tests with only a terminal and spreedsheet open
(to record
results). The times are the "best" times from multiple runs.
I also compare times from the program "primesieve" -
https://primesieve.org/
which states on its site:
------------------------------------------------------------------------------------
primesieve is a free software program and C/C++ library that
generates primes using a highly optimized sieve of Eratosthenes
implementation. It counts the primes below 10^10 in just 0.45
seconds on an Intel Core i7-6700 CPU (4 x 3.4 GHz). primesieve
can generate primes and prime k‑tuplets up to 2^64.
------------------------------------------------------------------------------------
(An optimized C/C++ versions of twinprimes_ssoz would be nice to
compare against too.)
Number | Twimprimes | Last Twinprime |twinprime_ssoz.nim,
gcc-8.2.1| primesieve |
| | | 0.18.0 |
0.19.0 | 9.7.1 |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^09 | 3424506 | 1999999872 | 0.043 |
0.041 | 0.049 |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^09 | 14618166 | 4999999860 | 0.219 |
0.226 | 0.248 |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^10 | 27412679 | 9999999702 | 0.398 |
0.401 | 0.533 |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^10 | 118903682 | 49999999590 | 2.364 |
2.331 | 2.978 |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^11 | 224376048 | 99999999762 | 4.586 |
4.476 | 6.189 |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^11 | 986222314 | 499999999062 | 26.625 |
26.578 | 35.120 |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^12 | 1870585220 | 999999999960 | 57.895 |
58.195 | 73.966 |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^12 | 8312493003 | 4999999999878 | 385.562 |
389.347 | 433.320 |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^13 |15834664872 | 9999999998490 | 855.705 |
857.771 | 934.253 |
----------|------------|----------------|--------------|--------------|------------|
2 x 10^13 |30198862775 | 19999999999830 | 1806.282 |
1818.766 | 1998.341 |
----------|------------|----------------|--------------|--------------|------------|
A typical output of the "twinprimes_ssoz.nim" files looks as
below"
$ echo 123_345_789_000 | ./twinprimes_test7yc.0180.gcc821
Enter integer number: threads = 8
each thread segment is [1 x 98304] bytes array // 98,304
segment size (Kn)
twinprime candidates = 6099517038; resgroups = 4107419 // max
twins; max cols (KB)
each 1485 threads has nextp[2 x 30062] array //
30,062, primes <= sqrt(N)
setup time = 0.006 secs // time
for primes|ssoz setup
perform twinprimes ssoz sieve // ssoz
starts now
sieve time = 5.771 secs // time
to do just the ssoz
last segment = 76955 resgroups; segment slices = 42 // last
seg cols; seg loops
total twins = 272018910; last twin = 123345788640+/-1 //
twinprimes, val for last
total time = 5.778 secs // setup
+ ssoz time
You can determine the selected PG by the number of threads being
used.
I'm writing up the paper, and will probably release the program
operation description
to provide a detailed understanding of how|why it works, which
will allow for modification.