[jira] [Commented] (RNG-54) StringSampler

Alex D Herbert (JIRA) Sat, 22 Sep 2018 16:37:40 -0700


    [ 
https://issues.apache.org/jira/browse/RNG-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624863#comment-16624863
 ]


Alex D Herbert commented on RNG-54:
-----------------------------------

The suggested class in Commons Math is close to the {{NextBytesAndHexString}} 
in the results table above, i.e. encoding bytes as a hex string.
  
 But the way it generates the hex characters is non optimal.
  
 This:
  
{code:java}
final Integer c = Integer.valueOf(randomBytes[i]);

// Add 128 to byte value to make interval 0-255 before
// conversion to hex.
// This guarantees <= 2 hex digits from "toHexString".
// "toHexString" would otherwise add 2^32 to negative arguments.
String hex = Integer.toHexString(c.intValue() + 128);

// Make sure we add 2 hex digits for each byte.
if (hex.length() == 1) {
    hex = "0" + hex;
}
outBuffer.append(hex);
{code}
Should really be:
{code:java}
// Hex table
private static final char[] TABLE16 = {
            '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
            'A', 'B', 'C', 'D', 'E', 'F'
};

// Write chars directly (no StringBuilder) as the size is fixed.
// Use the upper and lower 4 bits to generate 2 hex characters.
outBuffer[count++] = TABLE16[ (randomBytes[i] >>> 4) & 0x0F ];
outBuffer[count++] = TABLE16[ randomBytes[i] & 0x0F ];
{code}
Using {{Integer.valueOf}} and {{Integer.toHexString}} is slower. Also note 
there is no need to use {{StringBuilder}} if writing a fixed size of chars, 1 
char at a time. The array can be written directly avoiding the overhead of 
capacity checks in {{StringBuilder}}. This also removes the concatenation of 
strings using {{+}} when a {{StringBuilder}} is in use (which is sub-optimal).

 

I note the SHA1 option is to provide a cryptographically secure random hex 
strings as per SHA1PRNG:
 
[https://docs.oracle.com/javase/8/docs/technotes/guides/security/StandardNames.html#SecureRandom]

This is a nice addition. However it is documented and implemented wrong. It 
states it generates the hex string in 40 byte blocks, but this is actually 40 
char blocks, i.e. 2 chars per byte from 20 random bytes. But then the algorithm 
implementation generates 40 random bytes to pass to the SHA1 digest when it 
only needs to pass 20 (to match what is documented).

This should really be patched. I can create a ticket for commons math if you 
would like.

If fixed this algorithm would the match of the {{NextBytesAndHexString}} in the 
results table above, i.e. encoding bytes as a hex string.

 

However there is still the method to use {{nextInt}} directly as the source for 
random bytes, avoiding having to create a {{byte[]}} and fill it. This is the 
implementation I currently have working for binary, octal, hex and base64 
(ASCII armoured text) strings.

I would like to do a PR for my example string sampler for the RNG examples 
module. I recently added the utility to generate strings with any radix from 
2-64 by using a simple implementation for all other radix sizes. So it can 
build for example radix 36 strings using the alphabet {{[0-9A-Z]}}.

Note that the fast algorithm for using the bytes to generate a character index 
could use any power of 2 for the Character radix. I didn't add all radix sizes 
but instead concentrated on radix strings that have special meaning. This 
should be fine for an examples class.

 

> StringSampler
> -------------
>
>                 Key: RNG-54
>                 URL: https://issues.apache.org/jira/browse/RNG-54
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: sampling
>    Affects Versions: 1.1
>            Reporter: Alex D Herbert
>            Priority: Minor
>
> There is currently no equivalent for the function 
> {{org.apache.commons.math3.random.RandomDataGenerator.nextHexString(int)}}.
> Here is the original version adapted to use the {{UniformRandomProvider:}}
> {code:java}
> public String nextHexStringOriginal(UniformRandomProvider ran, int len) {
>     // Initialize output buffer
>     StringBuilder outBuffer = new StringBuilder();
>     // Get int(len/2)+1 random bytes
>     byte[] randomBytes = new byte[(len / 2) + 1];
>     ran.nextBytes(randomBytes);
>     // Convert each byte to 2 hex digits
>     for (int i = 0; i < randomBytes.length; i++) {
>         Integer c = Integer.valueOf(randomBytes[i]);
>         /*
>          * Add 128 to byte value to make interval 0-255 before doing hex 
> conversion.
>          * This guarantees <= 2 hex digits from toHexString() toHexString 
> would
>          * otherwise add 2^32 to negative arguments.
>          */
>         String hex = Integer.toHexString(c.intValue() + 128);
>         // Make sure we add 2 hex digits for each byte
>         if (hex.length() == 1) {
>             hex = "0" + hex;
>         }
>         outBuffer.append(hex);
>     }
>     return outBuffer.toString().substring(0, len);
> }
> {code}
> Note: I removed the length check to make the speed test (see below) fair.
> This makes use of {{StringBuider}} and is not very efficient. I have created 
> a version based on the Hex encoding within 
> {{org.apache.commons.codec.digest.DigestUtils}} and 
> {{org.apache.commons.codec.binary.Hex}}. This uses a direct look-up of the 
> hex character using the index from successive 4 bits of a byte array to form 
> an index from 0-15.
> Here's the function without details of how the {{byte[]}} is correctly sized:
> {code:java}
> private static final char[] DIGITS_LOWER = { 
>     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 
> 'e', 'f' };
> private static String nextHexString(UniformRandomProvider rng, byte[] bytes, 
> int length) {
>     rng.nextBytes(bytes);
>     // Use the upper and lower 4 bits of each byte as an
>     // index in the range 0-15 for each hex character.
>     final char[] out = new char[length];
>     // Run the loop without checking index j by producing characters
>     // up to the size below the desired length.
>     final int loopLimit = length / 2;
>     int i = 0, j = 0;
>     while (i < loopLimit) {
>         final byte b = bytes[i];
>         // 0x0F == 0x01 | 0x02 | 0x04 | 0x08
>         out[j++] = DIGITS_LOWER[(b >>> 4) & 0x0F];
>         out[j++] = DIGITS_LOWER[b & 0x0F];
>         i++;
>     }
>     // The final character
>     if (j < length)
>         out[j++] = DIGITS_LOWER[(bytes[i] >>> 4) & 0x0F];
>     return new String(out);
> }
> {code}
> I've compared this to the original function and a modified one below that 
> computes the exact same strings:
> {code:java}
> public String nextHexStringModified(UniformRandomProvider ran, int len) {
>     // Initialize output buffer
>     StringBuilder outBuffer = new StringBuilder();
>     // byte[] randomBytes = new byte[(len/2) + 1]; // ORIGINAL
>     byte[] randomBytes = new byte[(len + 1) / 2];
>     ran.nextBytes(randomBytes);
>     // Convert each byte to 2 hex digits
>     for (int i = 0; i < randomBytes.length; i++) {
>         // ORIGINAL
>         // Integer c = Integer.valueOf(randomBytes[i]);
>         // String hex = Integer.toHexString(c.intValue() + 128);
>         String hex = Integer.toHexString(randomBytes[i] & 0xff);
>         // Make sure we add 2 hex digits for each byte
>         if (hex.length() == 1) {
>             outBuffer.append('0');
>         }
>         outBuffer.append(hex);
>     }
>     return outBuffer.toString().substring(0, len);
> }
> {code}
> The timings are:
>  
> ||Name||Time||Relative||
> |StringSampler|316103|0.073|
> |nextHexStringModified|3708104|0.853|
> |nextHexStringOriginal|4348063|1.000|
> This is not using JMH but the results show the method performs better.
> The full {{StringSampler}} class supports a radix of 2, 8, and 16 for binary, 
> octal and hex strings.
> JUnit tests show: the sampler computes the same values as 
> {{nextHexStringModified(int);}} edges cases are handled with exceptions; and 
> the output strings are uniform for each of the supported character sets 
> (using a Chi Squared test).
> Can I create a PR for a {{org.apache.commons.rng.sampling.StringSampler}}?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (RNG-54) StringSampler

Reply via email to