[
https://issues.apache.org/jira/browse/RNG-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16582795#comment-16582795
]
Alex D Herbert commented on RNG-54:
-----------------------------------
I just had a look at commons Text and that currently has no dependencies on
commons RNG.
However the class {{org.apache.commons.text.RandomStringGenerator}} can create
a random string using any source of randomness that can create nextInt(int
max). The Javadoc recommends using commons RNG to provide this function.
However it will fall back to {{java.util.concurrent.ThreadLocalRandom}} if
missing.
The class actually provides random functionality for strings with any character
set (passed as a list), or using a range of character code points.
There is also {{org.apache.commons.lang3.RandomStringUtils}} which also
supports picking from an array of char. So I've added these to a benchmark
using JMH.
There is also the options of using Commons Codec to encode the bytes array into
something. So I've added this to the test.
All methods were called with the same RNG (wrapped if appropriate) and restored
to the same sate before each run. The table shows the relative speed using the
average of 10 runs generating 10000 random strings of the given length:
||Benchmark||Benchmark||Length||Relative||Score||Score Error (99.9%)||Source||
|Codec|NextBytesAndHexString|32|0.070|655.422|134.936|MWC_256|
|Math|RandomDataGenerator_nextHexStringModified|32|0.493|4580.968|593.928|MWC_256|
|Math|RandomDataGenerator_nextHexStringOriginal|32|0.644|5989.369|1157.043|MWC_256|
|Text|RandomStringGeneratorWith16CodePoints|32|0.391|3635.981|492.750|MWC_256|
|Text|RandomStringGeneratorWithCharArray|32|1.000|9300.704|1807.697|MWC_256|
|Lang|RandomStringUtilsWith16CodePoints|32|0.242|2247.133|352.087|MWC_256|
|Lang|RandomStringUtilsWithCharArray|32|0.217|2015.916|424.406|MWC_256|
|new|StringSampler|32|0.064|595.156|3.724|MWC_256|
|Codec|NextBytesAndHexString|1024|0.063|17660.668|2972.286|MWC_256|
|Math|RandomDataGenerator_nextHexStringModified|1024|0.575|162216.015|42357.048|MWC_256|
|Math|RandomDataGenerator_nextHexStringOriginal|1024|0.725|204565.773|27611.786|MWC_256|
|Text|RandomStringGeneratorWith16CodePoints|1024|0.396|111745.938|8382.560|MWC_256|
|Text|RandomStringGeneratorWithCharArray|1024|1.000|282045.389|51805.233|MWC_256|
|Lang|RandomStringUtilsWith16CodePoints|1024|0.268|75699.276|10016.061|MWC_256|
|Lang|RandomStringUtilsWithCharArray|1024|0.198|55732.681|195.122|MWC_256|
|new|StringSampler|1024|0.062|17350.684|3769.534|MWC_256|
So:
* RandomStringGenerator with a {{char[]}} is slower than the other
implementations in other commons packages. This is because the char is
generated by conversion of a character to a string then to a code point. The
method that uses a range of code points directly is much faster.
* The old Commons Math implementation is the next slowest. It converts to a
Hex character using {{Integer.toHexString}} on each byte which generates 1 or 2
hex characters.
* Commons Lang RandomStringUtils is fastest of the current implementations
that support flexible charsets.
* StringSampler is fastest. 2x faster than Commons Lang for small strings and
4x faster than Commons Lang for long strings.
* However using Commons Codec to encode the byte array is basically the same
speed.
All the current implementations use a {{StringBuilder}} to create the string
and use {{nextInt(int)}} once per character.
{{StringSampler}} directly writes to a {{char[]}} and uses the
{{nextBytes(byte[] bytes)}} method of {{UniformRandomProvider}}. The
implementation is actually a reverse of the packing of int or long values into
a byte array performed by the base
{{org.apache.commons.rng.core.source32.IntProvider}} and
{{org.apache.commons.rng.core.source64.LongProvider}}.
This is basically matched by using Commons Codec and so if anything the
nextHexString method should probably be added to that library.
> StringSampler
> -------------
>
> Key: RNG-54
> URL: https://issues.apache.org/jira/browse/RNG-54
> Project: Commons RNG
> Issue Type: Improvement
> Components: sampling
> Affects Versions: 1.1
> Reporter: Alex D Herbert
> Priority: Minor
>
> There is currently no equivalent for the function
> {{org.apache.commons.math3.random.RandomDataGenerator.nextHexString(int)}}.
> Here is the original version adapted to use the {{UniformRandomProvider:}}
> {code:java}
> public String nextHexStringOriginal(UniformRandomProvider ran, int len) {
> // Initialize output buffer
> StringBuilder outBuffer = new StringBuilder();
> // Get int(len/2)+1 random bytes
> byte[] randomBytes = new byte[(len / 2) + 1];
> ran.nextBytes(randomBytes);
> // Convert each byte to 2 hex digits
> for (int i = 0; i < randomBytes.length; i++) {
> Integer c = Integer.valueOf(randomBytes[i]);
> /*
> * Add 128 to byte value to make interval 0-255 before doing hex
> conversion.
> * This guarantees <= 2 hex digits from toHexString() toHexString
> would
> * otherwise add 2^32 to negative arguments.
> */
> String hex = Integer.toHexString(c.intValue() + 128);
> // Make sure we add 2 hex digits for each byte
> if (hex.length() == 1) {
> hex = "0" + hex;
> }
> outBuffer.append(hex);
> }
> return outBuffer.toString().substring(0, len);
> }
> {code}
> Note: I removed the length check to make the speed test (see below) fair.
> This makes use of {{StringBuider}} and is not very efficient. I have created
> a version based on the Hex encoding within
> {{org.apache.commons.codec.digest.DigestUtils}} and
> {{org.apache.commons.codec.binary.Hex}}. This uses a direct look-up of the
> hex character using the index from successive 4 bits of a byte array to form
> an index from 0-15.
> Here's the function without details of how the {{byte[]}} is correctly sized:
> {code:java}
> private static final char[] DIGITS_LOWER = {
> '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd',
> 'e', 'f' };
> private static String nextHexString(UniformRandomProvider rng, byte[] bytes,
> int length) {
> rng.nextBytes(bytes);
> // Use the upper and lower 4 bits of each byte as an
> // index in the range 0-15 for each hex character.
> final char[] out = new char[length];
> // Run the loop without checking index j by producing characters
> // up to the size below the desired length.
> final int loopLimit = length / 2;
> int i = 0, j = 0;
> while (i < loopLimit) {
> final byte b = bytes[i];
> // 0x0F == 0x01 | 0x02 | 0x04 | 0x08
> out[j++] = DIGITS_LOWER[(b >>> 4) & 0x0F];
> out[j++] = DIGITS_LOWER[b & 0x0F];
> i++;
> }
> // The final character
> if (j < length)
> out[j++] = DIGITS_LOWER[(bytes[i] >>> 4) & 0x0F];
> return new String(out);
> }
> {code}
> I've compared this to the original function and a modified one below that
> computes the exact same strings:
> {code:java}
> public String nextHexStringModified(UniformRandomProvider ran, int len) {
> // Initialize output buffer
> StringBuilder outBuffer = new StringBuilder();
> // byte[] randomBytes = new byte[(len/2) + 1]; // ORIGINAL
> byte[] randomBytes = new byte[(len + 1) / 2];
> ran.nextBytes(randomBytes);
> // Convert each byte to 2 hex digits
> for (int i = 0; i < randomBytes.length; i++) {
> // ORIGINAL
> // Integer c = Integer.valueOf(randomBytes[i]);
> // String hex = Integer.toHexString(c.intValue() + 128);
> String hex = Integer.toHexString(randomBytes[i] & 0xff);
> // Make sure we add 2 hex digits for each byte
> if (hex.length() == 1) {
> outBuffer.append('0');
> }
> outBuffer.append(hex);
> }
> return outBuffer.toString().substring(0, len);
> }
> {code}
> The timings are:
>
> ||Name||Time||Relative||
> |StringSampler|316103|0.073|
> |nextHexStringModified|3708104|0.853|
> |nextHexStringOriginal|4348063|1.000|
> This is not using JMH but the results show the method performs better.
> The full {{StringSampler}} class supports a radix of 2, 8, and 16 for binary,
> octal and hex strings.
> JUnit tests show: the sampler computes the same values as
> {{nextHexStringModified(int);}} edges cases are handled with exceptions; and
> the output strings are uniform for each of the supported character sets
> (using a Chi Squared test).
> Can I create a PR for a {{org.apache.commons.rng.sampling.StringSampler}}?
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)