[
https://issues.apache.org/jira/browse/RNG-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624863#comment-16624863
]
Alex D Herbert commented on RNG-54:
-----------------------------------
The suggested class in Commons Math is close to the {{NextBytesAndHexString}}
in the results table above, i.e. encoding bytes as a hex string.
But the way it generates the hex characters is non optimal.
This:
{code:java}
final Integer c = Integer.valueOf(randomBytes[i]);
// Add 128 to byte value to make interval 0-255 before
// conversion to hex.
// This guarantees <= 2 hex digits from "toHexString".
// "toHexString" would otherwise add 2^32 to negative arguments.
String hex = Integer.toHexString(c.intValue() + 128);
// Make sure we add 2 hex digits for each byte.
if (hex.length() == 1) {
hex = "0" + hex;
}
outBuffer.append(hex);
{code}
Should really be:
{code:java}
// Hex table
private static final char[] TABLE16 = {
'0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
'A', 'B', 'C', 'D', 'E', 'F'
};
// Write chars directly (no StringBuilder) as the size is fixed.
// Use the upper and lower 4 bits to generate 2 hex characters.
outBuffer[count++] = TABLE16[ (randomBytes[i] >>> 4) & 0x0F ];
outBuffer[count++] = TABLE16[ randomBytes[i] & 0x0F ];
{code}
Using {{Integer.valueOf}} and {{Integer.toHexString}} is slower. Also note
there is no need to use {{StringBuilder}} if writing a fixed size of chars, 1
char at a time. The array can be written directly avoiding the overhead of
capacity checks in {{StringBuilder}}. This also removes the concatenation of
strings using {{+}} when a {{StringBuilder}} is in use (which is sub-optimal).
I note the SHA1 option is to provide a cryptographically secure random hex
strings as per SHA1PRNG:
[https://docs.oracle.com/javase/8/docs/technotes/guides/security/StandardNames.html#SecureRandom]
This is a nice addition. However it is documented and implemented wrong. It
states it generates the hex string in 40 byte blocks, but this is actually 40
char blocks, i.e. 2 chars per byte from 20 random bytes. But then the algorithm
implementation generates 40 random bytes to pass to the SHA1 digest when it
only needs to pass 20 (to match what is documented).
This should really be patched. I can create a ticket for commons math if you
would like.
If fixed this algorithm would the match of the {{NextBytesAndHexString}} in the
results table above, i.e. encoding bytes as a hex string.
However there is still the method to use {{nextInt}} directly as the source for
random bytes, avoiding having to create a {{byte[]}} and fill it. This is the
implementation I currently have working for binary, octal, hex and base64
(ASCII armoured text) strings.
I would like to do a PR for my example string sampler for the RNG examples
module. I recently added the utility to generate strings with any radix from
2-64 by using a simple implementation for all other radix sizes. So it can
build for example radix 36 strings using the alphabet {{[0-9A-Z]}}.
Note that the fast algorithm for using the bytes to generate a character index
could use any power of 2 for the Character radix. I didn't add all radix sizes
but instead concentrated on radix strings that have special meaning. This
should be fine for an examples class.
> StringSampler
> -------------
>
> Key: RNG-54
> URL: https://issues.apache.org/jira/browse/RNG-54
> Project: Commons RNG
> Issue Type: Improvement
> Components: sampling
> Affects Versions: 1.1
> Reporter: Alex D Herbert
> Priority: Minor
>
> There is currently no equivalent for the function
> {{org.apache.commons.math3.random.RandomDataGenerator.nextHexString(int)}}.
> Here is the original version adapted to use the {{UniformRandomProvider:}}
> {code:java}
> public String nextHexStringOriginal(UniformRandomProvider ran, int len) {
> // Initialize output buffer
> StringBuilder outBuffer = new StringBuilder();
> // Get int(len/2)+1 random bytes
> byte[] randomBytes = new byte[(len / 2) + 1];
> ran.nextBytes(randomBytes);
> // Convert each byte to 2 hex digits
> for (int i = 0; i < randomBytes.length; i++) {
> Integer c = Integer.valueOf(randomBytes[i]);
> /*
> * Add 128 to byte value to make interval 0-255 before doing hex
> conversion.
> * This guarantees <= 2 hex digits from toHexString() toHexString
> would
> * otherwise add 2^32 to negative arguments.
> */
> String hex = Integer.toHexString(c.intValue() + 128);
> // Make sure we add 2 hex digits for each byte
> if (hex.length() == 1) {
> hex = "0" + hex;
> }
> outBuffer.append(hex);
> }
> return outBuffer.toString().substring(0, len);
> }
> {code}
> Note: I removed the length check to make the speed test (see below) fair.
> This makes use of {{StringBuider}} and is not very efficient. I have created
> a version based on the Hex encoding within
> {{org.apache.commons.codec.digest.DigestUtils}} and
> {{org.apache.commons.codec.binary.Hex}}. This uses a direct look-up of the
> hex character using the index from successive 4 bits of a byte array to form
> an index from 0-15.
> Here's the function without details of how the {{byte[]}} is correctly sized:
> {code:java}
> private static final char[] DIGITS_LOWER = {
> '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd',
> 'e', 'f' };
> private static String nextHexString(UniformRandomProvider rng, byte[] bytes,
> int length) {
> rng.nextBytes(bytes);
> // Use the upper and lower 4 bits of each byte as an
> // index in the range 0-15 for each hex character.
> final char[] out = new char[length];
> // Run the loop without checking index j by producing characters
> // up to the size below the desired length.
> final int loopLimit = length / 2;
> int i = 0, j = 0;
> while (i < loopLimit) {
> final byte b = bytes[i];
> // 0x0F == 0x01 | 0x02 | 0x04 | 0x08
> out[j++] = DIGITS_LOWER[(b >>> 4) & 0x0F];
> out[j++] = DIGITS_LOWER[b & 0x0F];
> i++;
> }
> // The final character
> if (j < length)
> out[j++] = DIGITS_LOWER[(bytes[i] >>> 4) & 0x0F];
> return new String(out);
> }
> {code}
> I've compared this to the original function and a modified one below that
> computes the exact same strings:
> {code:java}
> public String nextHexStringModified(UniformRandomProvider ran, int len) {
> // Initialize output buffer
> StringBuilder outBuffer = new StringBuilder();
> // byte[] randomBytes = new byte[(len/2) + 1]; // ORIGINAL
> byte[] randomBytes = new byte[(len + 1) / 2];
> ran.nextBytes(randomBytes);
> // Convert each byte to 2 hex digits
> for (int i = 0; i < randomBytes.length; i++) {
> // ORIGINAL
> // Integer c = Integer.valueOf(randomBytes[i]);
> // String hex = Integer.toHexString(c.intValue() + 128);
> String hex = Integer.toHexString(randomBytes[i] & 0xff);
> // Make sure we add 2 hex digits for each byte
> if (hex.length() == 1) {
> outBuffer.append('0');
> }
> outBuffer.append(hex);
> }
> return outBuffer.toString().substring(0, len);
> }
> {code}
> The timings are:
>
> ||Name||Time||Relative||
> |StringSampler|316103|0.073|
> |nextHexStringModified|3708104|0.853|
> |nextHexStringOriginal|4348063|1.000|
> This is not using JMH but the results show the method performs better.
> The full {{StringSampler}} class supports a radix of 2, 8, and 16 for binary,
> octal and hex strings.
> JUnit tests show: the sampler computes the same values as
> {{nextHexStringModified(int);}} edges cases are handled with exceptions; and
> the output strings are uniform for each of the supported character sets
> (using a Chi Squared test).
> Can I create a PR for a {{org.apache.commons.rng.sampling.StringSampler}}?
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)