[ 
https://issues.apache.org/jira/browse/RNG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509047#comment-17509047
 ] 

Alex Herbert commented on RNG-171:
----------------------------------

{quote}they trade code clarity for a modest memory gain
{quote}
I do not think the current methods are any more or less clear. The cached 
boolean implementation in the codebase still requires an understanding of 
masking and bit shifts.

The new boolean method uses half the memory and is faster. IMO this is a valid 
gain.

With regard to the int cache method the current method stores 32-bits for reuse 
in a 64-bit long. This is wasteful. It could implement the same method using an 
int and the boolean flag with basically the same code. I've not benchmarked 
that and cannot recall if I ever thought of it even though it is obvious now. 
Essentially all the needs to be cached for the next round is 33-bits. 32-bits 
for the value and the flag to say if the value is present.

However as I understand it using a boolean flag means that offsets for all 
fields in child classes will not naturally be memory aligned. It would be up to 
the JVM to decide to align to 4/8 byte blocks for the next field or not. This 
would be all the state fields for a RNG based on the abstract class. Ideally 
these should be memory aligned for fastest access. If the JVM does align them 
then you gain no memory advantage using a boolean over the implementation using 
effectively 32-bits for a single bit flag.

If the method is changed to use a long to store the 32-bits and a flag then the 
fields will remain aligned. This would require a benchmark to verify if 
performance is affected for any child classes.

Here are some more results on two platforms with JDK 8, 11, 17 for the int 
generation.

MacBook
|CachedGenerationPerformance| |8|11|17|
|cacheMethod|randomSourceName|Score|Score|Score|
|-1|SPLIT_MIX_64|3858.64|7185.44|4148.66|
|0|SPLIT_MIX_64|3122.25|3482.81|3908.80|
|4|SPLIT_MIX_64|3289.90|3411.38|4235.49|
|5|SPLIT_MIX_64|3418.00|3444.90|4087.53|
|6|SPLIT_MIX_64|3483.51|3531.09|4250.82|
|7|SPLIT_MIX_64|3332.12|3339.13|4063.81|
|8|SPLIT_MIX_64|3402.04|3526.50|4069.37|
|-1|XOR_SHIFT_1024_S|4622.16|4623.75|4787.81|
|0|XOR_SHIFT_1024_S|3592.54|3481.52|3932.36|
|4|XOR_SHIFT_1024_S|3925.31|4509.47|4068.30|
|5|XOR_SHIFT_1024_S|3879.26|3951.73|4226.37|
|6|XOR_SHIFT_1024_S|3872.55|3873.99|4252.48|
|7|XOR_SHIFT_1024_S|3802.10|3776.63|3920.08|
|8|XOR_SHIFT_1024_S|3888.22|4001.32|4217.29|

Workstation desktop
|CachedGenerationPerformance|8|11|17|
|cacheMethod|randomSourceName|Score|Score|Score|
|-1|SPLIT_MIX_64|4020.14|4012.83|4016.14|
|0|SPLIT_MIX_64|3478.30|3481.57|3491.51|
|4|SPLIT_MIX_64|3333.82|3360.07|3419.41|
|5|SPLIT_MIX_64|3355.61|3358.00|3386.43|
|6|SPLIT_MIX_64|3390.77|3484.99|3484.83|
|7|SPLIT_MIX_64|3159.29|3277.54|3247.06|
|8|SPLIT_MIX_64|3343.22|3356.03|3395.30|
|-1|XOR_SHIFT_1024_S|4742.75|4753.28|5044.07|
|0|XOR_SHIFT_1024_S|4341.41|4550.42|4413.60|
|4|XOR_SHIFT_1024_S|3846.62|3860.63|3931.35|
|5|XOR_SHIFT_1024_S|3866.00|3901.33|3869.51|
|6|XOR_SHIFT_1024_S|3869.54|3915.91|3930.25|
|7|XOR_SHIFT_1024_S|3779.50|3814.34|3868.77|
|8|XOR_SHIFT_1024_S|3861.67|3888.29|3865.72|

Note:

I upgraded JMH from 1.16 to 1.34. I have not observed the strange warm-up cost 
for method 5 in this round of tests since updating. It may be due to JMH, or 
may have just been an issue benchmarking on my laptop originally (other 
processes ongoing).


On the workstation the method 0 is without the current caching implementation. 
It is fast for SPLIT_MIX but not for XOR_SHIFT. There is no 1.2-SNAPSHOT in the 
local maven repo so I do not know how the class was included. I extracted the 
LongProvider class from the shaded jar file and decompiled it. It does not have 
a caching implementation. Doing the same on the MacBook confirms that platform 
does have the caching implementation. So the results from the workstation show 
that caching is a not performance gain for the very fast generator. It is 
purely for using the entire 64-bit long output from the RNG for the 32-bit int 
output.

The most consistently fast method is 7, although differences are marginal. This 
is the cleanest code IMO. It has the fewest operations on the generated long 
value. But it would switch the output to little endian (low, high) rather than 
(high, low). Otherwise all methods appear to be the same speed. So the simplest 
to understand without breaking functional compatibility would be to mask the 
value stored to the low 32-bits (method 5). The upper bits act as a state flag, 
either they are empty or they are not.

I would prefer to make the change to the code, especially for the boolean cache 
which is a performance gain.


> Reduce memory footprint of cached int and boolean source
> --------------------------------------------------------
>
>                 Key: RNG-171
>                 URL: https://issues.apache.org/jira/browse/RNG-171
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.4
>            Reporter: Alex Herbert
>            Priority: Major
>             Fix For: 1.5
>
>
> The base implementation for the IntProvider and LongProvider caches values 
> from the generation method to be used as the source for boolean bits, and in 
> the case of the LongProvider, for int bits. The nextBoolean and nextInt 
> methods thus use all the bits output by the generator. The memory footprint 
> of this cache can be reduced.
> h3. Boolean
> IntProvider
> Currently a 32-bit value and a 32-bit mask is stored.
> However when the cache is refilled a single bit is used for the boolean 
> return value. The cache need only store the unused 31-bits. This leaves room 
> to use a bit to indicate when to reset the cache. A similar change can be 
> made for the LongProvider by only storing 63-bit unused bits.
> In this example for a 64-bit cache the booleanSource is initialised as 
> MIN_VALUE (a single bit in the most significant position).
> {code:java}
> @Override
> public boolean nextBoolean() {
>     long l = booleanSource;
>     if (l == Long.MIN_VALUE) {
>         // Refill
>         l = next();
>         // Store unused 63 bits and a refill flag, return highest bit
>         booleanSource = (l << 1) | 1;
>         return l < 0;
>     }
>     // Shift up eventually resetting, return current high bit
>     booleanSource = l << 1;
>     return l < 0;
> }
> {code}
> h3. Int
> LongProvider only
> Stores a 64-bit value and a boolean flag. However only 32-bits of the stored 
> value are used. Thus the remaining 32-bits can be used as a flag. In this 
> example the sign bit is used as it can be easily checked with a < operator.
> {code:java}
> @Override
> public int nextInt() {
>     long l = intSource;
>     if (l < 0) {
>         // Refill
>         l =next();
>         // Store low 32 bits, return high 32 bits
>         intSource = l & 0xffff_ffffL;
>         // OR
>         // Store low 63 bits, return high 32 bits
>         //intSource = (l << 1) >>> 1;
>         return (int) (l >>> 32);
>     }
>     // Reset and return previous low bits
>     intSource = -1;
>     return (int) l;
> }{code}
> The example above maintains compatibility to return the upper then lower 
> parts of the long for the 2 int values. An alternative with fewer operations 
> is:
> {code:java}
> @Override
> public int nextInt() {
>     long l = intSource;
>     if (l < 0) {
>         // Refill
>         l = next();
>         // FUNCTIONALLY BREAKING CHANGE
>         // Store high 32 bits, return low 32 bits
>         intSource = (l >>> 32);
>         return (int) l;
>     }
>     // Reset and return previous low bits
>     intSource = -1;
>     return (int) l;
> }
> {code}
> This is arguably simpler but would be a functionally breaking change.
> Memory footprint would change:
> {noformat}
> IntProvider
> 8 bytes (int,int)                 -> 4 bytes (int)
> LongProvider
> 25 bytes (long,long,long,boolean) -> 16 bytes (long,long)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to