[jira] [Commented] (RNG-171) Reduce memory footprint of cached int and boolean source

Alex Herbert (Jira) Sun, 20 Mar 2022 00:21:06 -0700


    [ 
https://issues.apache.org/jira/browse/RNG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509387#comment-17509387
 ]


Alex Herbert commented on RNG-171:
----------------------------------

{quote}Break functional compatibility
{quote}
I agree we have done this before, for example when nextDouble was changed to 
output 2^53 dyadic rationals instead of 2^52 dyadic rationals. But I also do 
think it is documented. Perhaps a note in the user guide, release notes header, 
and package javadoc that the output of nextLong or nextInt (respectively) 
should match the reference implementation for the algorithm but the methods for 
derived types are not subject to compatibility restraints. Placing the same 
note in many places will make sure it has visibility.

Here are some methods I tried for nextBoolean with little-endian:

Method 6: Detect bits with a mask
{code:java}
@Override
public boolean nextBoolean() {
    int l = booleanSource;
    if (l == 1) {
        // Refill
        l = rng.next();
        // Store high 31 bits and a refill flag, return lowest bit
        booleanSource = (l >>> 1) | Integer.MIN_VALUE;
        return (l & 0x1) == 1;
    }
    // Shift down eventually resetting, return current lowest bit
    booleanSource = l >>> 1;
    return (l & 0x1) == 1;
}
{code}
Method 7: Detect bits with a sign test
{code:java}
@Override
public boolean nextBoolean() {
    int l = booleanSource;
    if (l == 1) {
        // Refill
        l = rng.next();
        // Store high 31 bits and a refill flag, return lowest bit
        booleanSource = (l >>> 1) | Integer.MIN_VALUE;
        return (l << 31) < 0;
    }
    // Shift down eventually resetting, return current lowest bit
    booleanSource = l >>> 1;
    return (l << 31) < 0;
}
{code}
h2. Results

I have included the errors this time.
h3. MacBook

This has the current caching method as method 0.
| | | |8|8|11|11|17|17| |
|Provider|Method|randomSourceName|Score|Error|Score|Error|Score|Error|Total|
|int|-1|KISS|6358.2|46.6|6331.3|78.9|6307.5|32.7|18997.0|
|int|-1|MWC_256|4597.2|38.1|4530.3|40.8|5002.6|45.1|14130.1|
|int|0|KISS|3092.2|13.4|3158.3|32.6|3517.7|85.3|9768.3|
|int|0|MWC_256|3098.9|32.0|3145.5|17.3|3517.8|18.0|9762.2|
|int|1|KISS|3270.8|46.3|3297.5|40.6|3507.0|15.2|10075.3|
|int|1|MWC_256|3352.0|35.6|3377.5|17.9|3854.3|14.2|10583.7|
|int|5|KISS|2917.0|12.9|3270.7|39.4|2935.1|17.9|9122.8|
|int|5|MWC_256|2894.3|33.5|2756.4|17.7|3192.6|14.6|8843.3|
|int|6|KISS|2698.7|185.9|2716.7|168.9|3480.8|15.6|8896.3|
|int|6|MWC_256|2654.4|34.6|2661.9|32.1|3503.8|34.5|8820.2|
|int|7|KISS|2936.6|15.2|2937.8|45.4|2969.2|116.1|8843.5|
|int|7|MWC_256|4163.1|21.7|4168.2|26.4|2919.6|14.8|11250.8|
|long|-1|SPLIT_MIX_64|4684.8|18.4|4704.2|62.7|4873.8|16.0|14262.8|
|long|-1|XOR_SHIFT_1024_S|5019.9|33.6|5021.0|64.7|5033.9|29.9|15074.8|
|long|0|SPLIT_MIX_64|3068.0|15.7|3315.0|22.1|3090.4|42.5|9473.4|
|long|0|XOR_SHIFT_1024_S|3130.7|45.6|3640.7|41.7|3097.0|65.4|9868.4|
|long|1|XOR_SHIFT_1024_S|3052.8|27.8|3145.4|315.5|3044.9|33.6|9243.0|
|long|1|SPLIT_MIX_64|3066.1|57.8|3012.5|115.6|3024.7|11.1|9103.3|
|long|5|SPLIT_MIX_64|3027.4|16.2|2955.7|30.4|3010.4|33.6|8993.5|
|long|5|XOR_SHIFT_1024_S|3047.6|47.5|2977.7|29.6|3009.1|16.6|9034.3|
|long|6|SPLIT_MIX_64|3042.1|24.2|3076.9|354.1|3009.6|36.8|9128.7|
|long|6|XOR_SHIFT_1024_S|3052.8|19.0|3061.5|38.9|3008.7|23.6|9123.1|
|long|7|SPLIT_MIX_64|3036.8|20.6|3034.7|14.6|3012.9|16.9|9084.4|
|long|7|XOR_SHIFT_1024_S|3046.9|17.6|3048.6|16.0|3023.5|18.0|9118.9|
 * Method 5 as before is faster than the current caching implementation
 * Method 6 is just as fast for the int provider although does have a higher 
error. It is a bit slower for the long provider, but still comparable to method 
1.
 * Method 7 is similar to method 6 except for MWC_256 where it is slower on JDK 
8 and 11.
 * A slowdown for int providers for JDK 11 to 17 is observed for methods 0, 1, 
6. Method 5 has timing jumps too: for MWC_256 it is a slowdown; for KISS the 
speed on JDK 11 is lower and JDK 8 and 17 around the same.

h3. Workstation

This does not have the caching method. So -1 vs 0 is a comparison of direct 
output of booleans from the Wrapper (-1) vs from the RNG with no class wrapping 
around it (0).
| | | |8|8|11|11|17|17| |
|Provider|Method|randomSourceName|Score|Error|Score|Error|Score|Error|Total|
|int|-1|KISS|5988.9|81.0|5938.3|69.5|6112.8|24.0|18040.1|
|int|-1|MWC_256|4653.5|14.6|4618.9|13.3|4661.2|12.6|13933.6|
|int|0|KISS|4703.8|21.7|4902.1|9.5|5019.9|61.3|14625.8|
|int|0|MWC_256|4105.3|57.8|3884.7|39.6|3888.9|7.0|11878.9|
|int|1|KISS|2908.0|11.6|2897.9|8.4|3025.5|78.0|8831.4|
|int|1|MWC_256|2914.3|11.4|2893.0|17.3|3016.4|109.1|8823.7|
|int|5|KISS|2771.0|9.8|2779.6|20.6|2776.9|4.6|8327.6|
|int|5|MWC_256|2984.6|75.0|3238.6|7.0|2772.0|10.5|8995.2|
|int|6|KISS|2720.8|37.9|2555.1|5.7|2539.5|12.4|7815.4|
|int|6|MWC_256|2501.3|37.9|2525.3|17.9|2505.7|5.2|7532.2|
|int|7|KISS|2812.5|42.0|2799.8|6.6|2821.6|10.4|8433.9|
|int|7|MWC_256|2992.5|7.3|2985.5|5.9|2761.1|5.8|8739.0|
|long|-1|SPLIT_MIX_64|4156.6|11.5|4196.0|21.4|4259.8|15.9|12612.4|
|long|-1|XOR_SHIFT_1024_S|5207.6|126.4|5341.9|88.2|5288.2|9.6|15837.8|
|long|0|SPLIT_MIX_64|3630.3|10.6|3661.4|8.8|3673.4|4.2|10965.0|
|long|0|XOR_SHIFT_1024_S|4566.9|21.5|4501.2|6.7|4624.7|5.4|13692.7|
|long|1|SPLIT_MIX_64|2980.7|8.5|3106.9|23.1|3186.1|3.6|9273.7|
|long|1|XOR_SHIFT_1024_S|2981.5|17.5|2985.2|5.4|3066.3|47.3|9032.9|
|long|5|SPLIT_MIX_64|3083.3|9.6|2822.9|9.2|2850.5|33.8|8756.7|
|long|5|XOR_SHIFT_1024_S|3088.3|5.9|3089.8|13.7|2894.0|85.7|9072.1|
|long|6|SPLIT_MIX_64|2837.2|13.4|2837.2|5.3|2741.1|14.1|8415.4|
|long|6|XOR_SHIFT_1024_S|2834.0|20.5|2841.3|21.7|2748.5|11.6|8423.9|
|long|7|SPLIT_MIX_64|2834.7|7.3|2830.1|5.4|2854.6|79.3|8519.4|
|long|7|XOR_SHIFT_1024_S|2816.6|9.6|2806.0|11.3|2822.9|5.0|8445.4|
 * Method 6 is faster than the other methods for the int provider, and equal 
fastest (with method 7) for the long provider
 * Method 7 is around the speed of method 6 for long provider, and around the 
speed of method 5 for the int provider

Note: Here the timings are stable across JDKs. So timing stability is only an 
issue on the laptop.
h2. Conclusion

These results show that the sign test is not required for the speed-up. The 
speed advantage of the new methods may be gained from having to read 1 class 
variable per cycle rather than 2.

If we drop functional compatibility restraints then the int method can be 
chosen as method 7 which was the most stable of the fast methods. This would 
change to a little endian output from the long. The boolean method can be 
method 6 which is also little endian output. This appears to be the same speed 
as method 5 (using the sign test in big endian order) except for one noteable 
exception on the workstation for long providers where it is clearly the fastest 
method.

I will make the change to the current core implementation and run the 
generation performance benchmark for all providers for boolean and int (long 
providers). I will add a benchmark for the classic implementation for 
comparison:
{code:java}
public boolean nextBoolean() {
    return nextInt() < 0;  // or nextLong()
}
public boolean nextInt() {
    return (int) (nextLong() >>> 32);
}
{code}
This will show the effect of the cache on the performance for any provider in 
the library.

> Reduce memory footprint of cached int and boolean source
> --------------------------------------------------------
>
>                 Key: RNG-171
>                 URL: https://issues.apache.org/jira/browse/RNG-171
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.4
>            Reporter: Alex Herbert
>            Priority: Major
>             Fix For: 1.5
>
>
> The base implementation for the IntProvider and LongProvider caches values 
> from the generation method to be used as the source for boolean bits, and in 
> the case of the LongProvider, for int bits. The nextBoolean and nextInt 
> methods thus use all the bits output by the generator. The memory footprint 
> of this cache can be reduced.
> h3. Boolean
> IntProvider
> Currently a 32-bit value and a 32-bit mask is stored.
> However when the cache is refilled a single bit is used for the boolean 
> return value. The cache need only store the unused 31-bits. This leaves room 
> to use a bit to indicate when to reset the cache. A similar change can be 
> made for the LongProvider by only storing 63-bit unused bits.
> In this example for a 64-bit cache the booleanSource is initialised as 
> MIN_VALUE (a single bit in the most significant position).
> {code:java}
> @Override
> public boolean nextBoolean() {
>     long l = booleanSource;
>     if (l == Long.MIN_VALUE) {
>         // Refill
>         l = next();
>         // Store unused 63 bits and a refill flag, return highest bit
>         booleanSource = (l << 1) | 1;
>         return l < 0;
>     }
>     // Shift up eventually resetting, return current high bit
>     booleanSource = l << 1;
>     return l < 0;
> }
> {code}
> h3. Int
> LongProvider only
> Stores a 64-bit value and a boolean flag. However only 32-bits of the stored 
> value are used. Thus the remaining 32-bits can be used as a flag. In this 
> example the sign bit is used as it can be easily checked with a < operator.
> {code:java}
> @Override
> public int nextInt() {
>     long l = intSource;
>     if (l < 0) {
>         // Refill
>         l =next();
>         // Store low 32 bits, return high 32 bits
>         intSource = l & 0xffff_ffffL;
>         // OR
>         // Store low 63 bits, return high 32 bits
>         //intSource = (l << 1) >>> 1;
>         return (int) (l >>> 32);
>     }
>     // Reset and return previous low bits
>     intSource = -1;
>     return (int) l;
> }{code}
> The example above maintains compatibility to return the upper then lower 
> parts of the long for the 2 int values. An alternative with fewer operations 
> is:
> {code:java}
> @Override
> public int nextInt() {
>     long l = intSource;
>     if (l < 0) {
>         // Refill
>         l = next();
>         // FUNCTIONALLY BREAKING CHANGE
>         // Store high 32 bits, return low 32 bits
>         intSource = (l >>> 32);
>         return (int) l;
>     }
>     // Reset and return previous low bits
>     intSource = -1;
>     return (int) l;
> }
> {code}
> This is arguably simpler but would be a functionally breaking change.
> Memory footprint would change:
> {noformat}
> IntProvider
> 8 bytes (int,int)                 -> 4 bytes (int)
> LongProvider
> 25 bytes (long,long,long,boolean) -> 16 bytes (long,long)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (RNG-171) Reduce memory footprint of cached int and boolean source

Reply via email to