Matti Nykyri <matti.nyk...@iki.fi> schrieb:

>> On Jun 27, 2014, at 0:00, Kai Krakow <hurikha...@gmail.com> wrote:
>> 
>> Matti Nykyri <matti.nyk...@iki.fi> schrieb:
>> 
>>> If you are looking a mathematically perfect solution there is a simple
>>> one even if your list is not in the power of 2! Take 6 bits at a time of
>>> the random data. If the result is 62 or 63 you will discard the data and
>>> get the next 6 bits. This selectively modifies the random data but keeps
>>> the probabilities in correct balance. Now the probability for index of
>>> 0-61 is 1/62 because the probability to get 62-63 out of 64 if 0.
>> 
>> Why not do just something like this?
>> 
>> index = 0;
>> while (true) {
>>  index = (index + get_6bit_random()) % 62;
>>  output << char_array[index];
>> }
>> 
>> Done, no bits wasted. Should have perfect distribution also. We also
>> don't have to throw away random data just to stay within unaligned
>> boundaries. The unalignment is being taken over into the next loop so the
>> "error" corrects itself over time (it becomes distributed over the whole
>> set).
> 
> Distribution will not be perfect. The same original problem persists.
> Probability for index 0 to 1 will be 2/64 and for 2 to 61 it will be 1/64.
> Now the addition changes this so that index 0 to 1 reflects to previous
> character and not the original index.
> 
> The distribution of like 10GB of data should be quite even but not on a
> small scale. The next char will depend on previous char. It is 100% more
> likely that the next char is the same or one index above the previous char
> then any of the other ones in the series. So it is likely that you will
> have long sets of same character.

I cannot follow your reasoning here - but I'd like to learn. Actually, I ran 
this multiple times and never saw long sets of the same character, even no 
short sets of the same character. The 0 or 1 is always rolled over into the 
next random addition. I would only get sets of the same character if rand() 
returned zero multiple times after each other - which wouldn't be really 
random. ;-)

Keep in mind: The last index will be reused whenever you'd enter the 
function - it won't reset to zero. But still that primitive implementation 
had a flaw: It will tend to select characters beyond the current offset, if 
it is >= 1/2 into the complete set, otherwise it will prefer selecting 
characters before the offset.

In my tests I counted how ofter new_index > index and new_index < index, and 
it had a clear bias for the first. So I added swapping of the selected index 
with offset=0 in the set. Now the characters will be swapped and start to 
distribute that flaw. The distribution, however, didn't change.

Of course I'm no mathematician, I don't know how I'd calculate the 
probabilities for my implementation because it is sort of a recursive 
function (for get_rand()) when looking at it over time:

int get_rand() {
  static int index = 0;
  return (index = (index + get_6bit_rand()) % 62);
}

char get_char() {
  int index = get_rand();
  char tmp = chars[index];
  chars[index] = chars[0];
  return (chars[0] = tmp);
}

However, get_char() should return evenly distributes results.

What this shows, is, that while distribution is even among the result set, 
the implementation may still be flawed because results could be predictable 
for a subset of results. Or in other words: Simply looking at the 
distribution of results is not an indicator for randomness. I could change 
get_rand() in the following way:

int get_rand() {
  static int index = 0;
  return (index = (index + 1) % 62);
}

Results would be distributed even, but clearly it is not random.

-- 
Replies to list only preferred.


Reply via email to