Matti Nykyri <[email protected]> schrieb:
>> On Jun 27, 2014, at 0:00, Kai Krakow <[email protected]> wrote:
>>
>> Matti Nykyri <[email protected]> schrieb:
>>
>>> If you are looking a mathematically perfect solution there is a simple
>>> one even if your list is not in the power of 2! Take 6 bits at a time of
>>> the random data. If the result is 62 or 63 you will discard the data and
>>> get the next 6 bits. This selectively modifies the random data but keeps
>>> the probabilities in correct balance. Now the probability for index of
>>> 0-61 is 1/62 because the probability to get 62-63 out of 64 if 0.
>>
>> Why not do just something like this?
>>
>> index = 0;
>> while (true) {
>> index = (index + get_6bit_random()) % 62;
>> output << char_array[index];
>> }
>>
>> Done, no bits wasted. Should have perfect distribution also. We also
>> don't have to throw away random data just to stay within unaligned
>> boundaries. The unalignment is being taken over into the next loop so the
>> "error" corrects itself over time (it becomes distributed over the whole
>> set).
>
> Distribution will not be perfect. The same original problem persists.
> Probability for index 0 to 1 will be 2/64 and for 2 to 61 it will be 1/64.
> Now the addition changes this so that index 0 to 1 reflects to previous
> character and not the original index.
>
> The distribution of like 10GB of data should be quite even but not on a
> small scale. The next char will depend on previous char. It is 100% more
> likely that the next char is the same or one index above the previous char
> then any of the other ones in the series. So it is likely that you will
> have long sets of same character.
I cannot follow your reasoning here - but I'd like to learn. Actually, I ran
this multiple times and never saw long sets of the same character, even no
short sets of the same character. The 0 or 1 is always rolled over into the
next random addition. I would only get sets of the same character if rand()
returned zero multiple times after each other - which wouldn't be really
random. ;-)
Keep in mind: The last index will be reused whenever you'd enter the
function - it won't reset to zero. But still that primitive implementation
had a flaw: It will tend to select characters beyond the current offset, if
it is >= 1/2 into the complete set, otherwise it will prefer selecting
characters before the offset.
In my tests I counted how ofter new_index > index and new_index < index, and
it had a clear bias for the first. So I added swapping of the selected index
with offset=0 in the set. Now the characters will be swapped and start to
distribute that flaw. The distribution, however, didn't change.
Of course I'm no mathematician, I don't know how I'd calculate the
probabilities for my implementation because it is sort of a recursive
function (for get_rand()) when looking at it over time:
int get_rand() {
static int index = 0;
return (index = (index + get_6bit_rand()) % 62);
}
char get_char() {
int index = get_rand();
char tmp = chars[index];
chars[index] = chars[0];
return (chars[0] = tmp);
}
However, get_char() should return evenly distributes results.
What this shows, is, that while distribution is even among the result set,
the implementation may still be flawed because results could be predictable
for a subset of results. Or in other words: Simply looking at the
distribution of results is not an indicator for randomness. I could change
get_rand() in the following way:
int get_rand() {
static int index = 0;
return (index = (index + 1) % 62);
}
Results would be distributed even, but clearly it is not random.
--
Replies to list only preferred.