http://d.puremagic.com/issues/show_bug.cgi?id=3738

           Summary: MinstdRand0 and MinstdRand very poor performance
           Product: D
           Version: 2.039
          Platform: x86
        OS/Version: Linux
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nob...@puremagic.com
        ReportedBy: bary...@smp.if.uj.edu.pl


--- Comment #0 from Witold Baryluk <bary...@smp.if.uj.edu.pl> 2010-01-23 
17:52:24 PST ---
MinstdRand0 is instantiation of CongruentialGenerator with constants
  a = 16807
  c = 0
  m = 2^^31 - 1

most important method of CongruetialGenerator popFront, doesn't take into the
account special structure of c,
which makes this generator quite slow especially compared to Mersen Twister
Mt19937.

This is important especially on the 32bit machines, because modulo operation
(%) is done using external (and notinlined) method __ULDIV__ which computes
unsigned long division and reminder. But it is also a problem on 64bit
machines, becuase even in hardware modulo operation is slow.

Currently compiler uses known trick for x % c, when c=2^^k is constant.
It changes it to binary and: x & (c - 1)

This is done to uint directly. For signed x it compiler emits 2 additional
'sub's to compensate for sign. For smaller than int it also works. If c=2^^k is
bigger than x.max it should be noop (and maybe small warning), but it isn't
really important. For ulong on 32bit it even works smarter by just anding with
only upper register (or lower and zeroing upper). This is all done even without
-O switch.

Not many programer know that there is also a trick for constant c = 2^^k - 1.

You can find some information about it (special case c= 2^^32 - 1) in this
article on page 6.

G. Marsaglia, 'On the randomness of Pi and other decimal expansions',
Interstat, October 2005, #5,
http://interstat.statjournals.net/YEAR/2005/articles/0510005.pdf


What is important for us is that currently MinstdRand{,0} is very slow. Much
slower than MersenTwistter,
which is more complex than MinstdRand, which is quite supprising.

I propose changing current popFront of LinearCongruentialEngine:

     void popFront()
     {
         static if (m)
-            _x = cast(UIntType) ((cast(ulong) a * _x + c) % m);
+            static if (is(UIntType == uint) && m == 4294967295uL) {
+                 const ulong x = (cast(ulong) a * _x + c);
+                 const ulong v = x >> 32;
+                 const ulong w = (x & 0xffff_ffffuL);
+                 const uint y = cast(uint)(v+w);
+                 _x = (y < v || y >= 4294967295u) ? (y+1) : y;
+            } else static if (is(UIntType == uint) && m == 2147483647u) {
+                 const ulong x = (cast(ulong) a * _x + c);
+                 const ulong v = x >> 31;
+                 const ulong w = (x & 0x7fff_ffffuL);
+                 const uint y = cast(uint)(v+w);
+                 _x = (y < v || y >= 2147483647u) ? ((y+1) & 0x7fff_ffffu) :
y;
+            } else {
+                _x = cast(UIntType) ((cast(ulong) a * _x + c) % m);
+            }
         else
             _x = a * _x + c;
     }

According to my test first 10_000_000_000 random numbers (few periods) are
exactly the same with and without this change. I provided also 2^^32-1 version,
I tested it in different place, but there are also some generator with such
modulus.

It can by also easly changed to 2^^63-1 and 2^^64-1, but it will need ucent
support,
which we lack currently. But it will allow implementing one very good
generators
(read bellow).

I hope it can be made general and puted directly into backends of DVD and LLVM,
GCC to make it simpler
for programer. Currently non of D or C compiler which I tested uses this trick.

Time mesurments for original code summarizes table.
All test performed under Debian Gnu/Linux, unstable, dmd 2.039, Linux
2.6.33-rc4-sredniczarny-00399-g24bc734-dirty.
Cpu:
model name    : Intel(R) Pentium(R) M processor 1.73GHz
stepping    : 8
cpu MHz        : 1733.000
cache size    : 2048 KB

user root, realtime RR scheduler, priority 50 (this gives deviations smaller
than 1%).

compiled in "-O -inline -release" mode, with all generators in the same file as
main function.

1_000_000_000 sample points.

MinstdRand0, realtime rr 50:
   standard: 44.62 s
   inline: 41.04
   standard with tricks: 17.03 s
   inline with tricks: 17.00 s

MinstdRand, realtime rr 50
  standard: 44.88 s
  inline: 41.06
  standard with tricks: 17.05 s
  inline with tricks: 17.02 s

Mt19937, realtime rr 50:
  standard: 33.93 s



"Inline" is manually inlined version (interesingly it isn't always faster),
probably some problems with registers, because dmd2 isn't inlining popFront and
front functions unfortunetly (even that they are very simple). Tricks is
version with patch above.

As you can see after changes this method is about 2.6 times faster. And it is
faster than Mersen Twister.

This is quite interesting, becuase patched code have 2 branches, 3 binary ands,
and 2 additions. But as maybe few know, DIV/MOD operation is done using quite
big external (notinlined) assembler routing __ULDIV__, which is just slow for
us. So it can be usefull to also create utility function for our special mod
operation.

Unittests still passes.






MinstdRand0 and MinstdRand are pretty good generators, for this speed.
But unfortunetly they fail lots of tests (see TestU01 documentation). Mosly
becuase of small period.

I know that they are in Phobos primarly becuase of simplicity (which maybe my
patch destroys, if so please incorporate it directly into compiler, so also
others can be beneficial of this).


There are only few which are better, so simple and very fast.

One is m=2^^61-1,a=2^^30-2^^19. But unfortunetly it is slow on 32bit machines
(128bit intermidiates). It is fast on 64bit machines (about 10% slower than
MinstdRand0), but have much better statistical properties (see Wu 1997).

One of very interesting and extra simple generator is Marsa-LFIB4 by Marsaglia.
It is just:
   x_i = (x_{i-55} + x_{i-119} + x_{i-179} + x_{i-256}) mod 2^^32
no multiplications and no complicated modulo operations. It is very fast
(especially on 64-bit machines), have period of 2^^287 and passes practically
all tests (actually all). 2 lines of code.

Other is KISS99 also by Marsaglia. it passes also all tests. Period 2^^123.
It is combination of 3 very simple generators. 4 Lines of code.

There also few LaggedFibbonaci generators which are ultra fast. But only on
64bit machines. They passes all tests. One example is LFib(2^^64, 1279, 861, *)
with period 2^^1340.

I also know one fast random generator derived from cryptographic primitieves,
notably from Rijandle/AES cipher. It is known as ISAAC by Jenkins and is fvery
ast, and passes all tests. Code have about 130 lines.

There is also Matlab randn function which is simple (xor combination of simple
LCG with m=2^^32 and simple shifting generator. both by Marsaglia) and have
period 2^^64, is very fast and passes most tests. It is used there for Normal
varietes, but as integer generator it is very good.


All above presented generators are faster and better (according to tests) than
Mt19937.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Reply via email to