[jira] [Commented] (RNG-54) StringSampler

Alex D Herbert (JIRA) Mon, 24 Sep 2018 07:41:14 -0700


    [ 
https://issues.apache.org/jira/browse/RNG-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625913#comment-16625913
 ]


Alex D Herbert commented on RNG-54:
-----------------------------------

{quote}I meant "core implementation of the generation of bytes that can be 
interpreted as a sequence of hexadecimal digits".
{quote}
Sorry, I missed that.

In the basic form a sequence of bytes will generate 2 hex chars per byte. So 
the basic implementation is to get a byte array half the desired length of the 
string (rounded up):
{code:java}
int length  = ...; // Desired String length
byte[] bytes = new byte[(length+1) / 2];
rng.nextBytes(bytes);
// Then convert 'length' 4-bit blocks to hex chars
char[] chars = new char[length];
// ...{code}
I can see that the generation of bytes using the SHA1 method is different. But 
perhaps this is better put into a separate method in a different class:
{code:java}
byte[] nextSHA1Bytes(UniformRandomProvider rng, int length);
{code}
This could even be a wrapper around a {{UniformRandomProvider}} to re-implement 
{{UniformRandomProvider.nextBytes(byte[])}}.

So if you wanted to emulate the method from CM you could do:
{code:java}
UniformRandomProvider rng = ...;
int length  = ...; // Desired String length
// This class will override nextBytes(byte[]) and 
// delegate (or throw) for the other interface methods
UniformRandomProvider sha1Rng = new SHA1UniformRandomProvider(rng);
String s = RadixStringSampler.nextString(sha1Rng, length, 16);
{code}

Thus the sampling of strings then becomes an encoding example.


> StringSampler
> -------------
>
>                 Key: RNG-54
>                 URL: https://issues.apache.org/jira/browse/RNG-54
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: sampling
>    Affects Versions: 1.1
>            Reporter: Alex D Herbert
>            Priority: Minor
>
> There is currently no equivalent for the function 
> {{org.apache.commons.math3.random.RandomDataGenerator.nextHexString(int)}}.
> Here is the original version adapted to use the {{UniformRandomProvider:}}
> {code:java}
> public String nextHexStringOriginal(UniformRandomProvider ran, int len) {
>     // Initialize output buffer
>     StringBuilder outBuffer = new StringBuilder();
>     // Get int(len/2)+1 random bytes
>     byte[] randomBytes = new byte[(len / 2) + 1];
>     ran.nextBytes(randomBytes);
>     // Convert each byte to 2 hex digits
>     for (int i = 0; i < randomBytes.length; i++) {
>         Integer c = Integer.valueOf(randomBytes[i]);
>         /*
>          * Add 128 to byte value to make interval 0-255 before doing hex 
> conversion.
>          * This guarantees <= 2 hex digits from toHexString() toHexString 
> would
>          * otherwise add 2^32 to negative arguments.
>          */
>         String hex = Integer.toHexString(c.intValue() + 128);
>         // Make sure we add 2 hex digits for each byte
>         if (hex.length() == 1) {
>             hex = "0" + hex;
>         }
>         outBuffer.append(hex);
>     }
>     return outBuffer.toString().substring(0, len);
> }
> {code}
> Note: I removed the length check to make the speed test (see below) fair.
> This makes use of {{StringBuider}} and is not very efficient. I have created 
> a version based on the Hex encoding within 
> {{org.apache.commons.codec.digest.DigestUtils}} and 
> {{org.apache.commons.codec.binary.Hex}}. This uses a direct look-up of the 
> hex character using the index from successive 4 bits of a byte array to form 
> an index from 0-15.
> Here's the function without details of how the {{byte[]}} is correctly sized:
> {code:java}
> private static final char[] DIGITS_LOWER = { 
>     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 
> 'e', 'f' };
> private static String nextHexString(UniformRandomProvider rng, byte[] bytes, 
> int length) {
>     rng.nextBytes(bytes);
>     // Use the upper and lower 4 bits of each byte as an
>     // index in the range 0-15 for each hex character.
>     final char[] out = new char[length];
>     // Run the loop without checking index j by producing characters
>     // up to the size below the desired length.
>     final int loopLimit = length / 2;
>     int i = 0, j = 0;
>     while (i < loopLimit) {
>         final byte b = bytes[i];
>         // 0x0F == 0x01 | 0x02 | 0x04 | 0x08
>         out[j++] = DIGITS_LOWER[(b >>> 4) & 0x0F];
>         out[j++] = DIGITS_LOWER[b & 0x0F];
>         i++;
>     }
>     // The final character
>     if (j < length)
>         out[j++] = DIGITS_LOWER[(bytes[i] >>> 4) & 0x0F];
>     return new String(out);
> }
> {code}
> I've compared this to the original function and a modified one below that 
> computes the exact same strings:
> {code:java}
> public String nextHexStringModified(UniformRandomProvider ran, int len) {
>     // Initialize output buffer
>     StringBuilder outBuffer = new StringBuilder();
>     // byte[] randomBytes = new byte[(len/2) + 1]; // ORIGINAL
>     byte[] randomBytes = new byte[(len + 1) / 2];
>     ran.nextBytes(randomBytes);
>     // Convert each byte to 2 hex digits
>     for (int i = 0; i < randomBytes.length; i++) {
>         // ORIGINAL
>         // Integer c = Integer.valueOf(randomBytes[i]);
>         // String hex = Integer.toHexString(c.intValue() + 128);
>         String hex = Integer.toHexString(randomBytes[i] & 0xff);
>         // Make sure we add 2 hex digits for each byte
>         if (hex.length() == 1) {
>             outBuffer.append('0');
>         }
>         outBuffer.append(hex);
>     }
>     return outBuffer.toString().substring(0, len);
> }
> {code}
> The timings are:
>  
> ||Name||Time||Relative||
> |StringSampler|316103|0.073|
> |nextHexStringModified|3708104|0.853|
> |nextHexStringOriginal|4348063|1.000|
> This is not using JMH but the results show the method performs better.
> The full {{StringSampler}} class supports a radix of 2, 8, and 16 for binary, 
> octal and hex strings.
> JUnit tests show: the sampler computes the same values as 
> {{nextHexStringModified(int);}} edges cases are handled with exceptions; and 
> the output strings are uniform for each of the supported character sets 
> (using a Chi Squared test).
> Can I create a PR for a {{org.apache.commons.rng.sampling.StringSampler}}?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (RNG-54) StringSampler

Reply via email to