Matti Nykyri <matti.nyk...@iki.fi> schrieb: >> On Jun 27, 2014, at 0:00, Kai Krakow <hurikha...@gmail.com> wrote: >> >> Matti Nykyri <matti.nyk...@iki.fi> schrieb: >> >>> If you are looking a mathematically perfect solution there is a simple >>> one even if your list is not in the power of 2! Take 6 bits at a time of >>> the random data. If the result is 62 or 63 you will discard the data and >>> get the next 6 bits. This selectively modifies the random data but keeps >>> the probabilities in correct balance. Now the probability for index of >>> 0-61 is 1/62 because the probability to get 62-63 out of 64 if 0. >> >> Why not do just something like this? >> >> index = 0; >> while (true) { >> index = (index + get_6bit_random()) % 62; >> output << char_array[index]; >> } >> >> Done, no bits wasted. Should have perfect distribution also. We also >> don't have to throw away random data just to stay within unaligned >> boundaries. The unalignment is being taken over into the next loop so the >> "error" corrects itself over time (it becomes distributed over the whole >> set). > > Distribution will not be perfect. The same original problem persists. > Probability for index 0 to 1 will be 2/64 and for 2 to 61 it will be 1/64. > Now the addition changes this so that index 0 to 1 reflects to previous > character and not the original index. > > The distribution of like 10GB of data should be quite even but not on a > small scale. The next char will depend on previous char. It is 100% more > likely that the next char is the same or one index above the previous char > then any of the other ones in the series. So it is likely that you will > have long sets of same character.
I cannot follow your reasoning here - but I'd like to learn. Actually, I ran this multiple times and never saw long sets of the same character, even no short sets of the same character. The 0 or 1 is always rolled over into the next random addition. I would only get sets of the same character if rand() returned zero multiple times after each other - which wouldn't be really random. ;-) Keep in mind: The last index will be reused whenever you'd enter the function - it won't reset to zero. But still that primitive implementation had a flaw: It will tend to select characters beyond the current offset, if it is >= 1/2 into the complete set, otherwise it will prefer selecting characters before the offset. In my tests I counted how ofter new_index > index and new_index < index, and it had a clear bias for the first. So I added swapping of the selected index with offset=0 in the set. Now the characters will be swapped and start to distribute that flaw. The distribution, however, didn't change. Of course I'm no mathematician, I don't know how I'd calculate the probabilities for my implementation because it is sort of a recursive function (for get_rand()) when looking at it over time: int get_rand() { static int index = 0; return (index = (index + get_6bit_rand()) % 62); } char get_char() { int index = get_rand(); char tmp = chars[index]; chars[index] = chars[0]; return (chars[0] = tmp); } However, get_char() should return evenly distributes results. What this shows, is, that while distribution is even among the result set, the implementation may still be flawed because results could be predictable for a subset of results. Or in other words: Simply looking at the distribution of results is not an indicator for randomness. I could change get_rand() in the following way: int get_rand() { static int index = 0; return (index = (index + 1) % 62); } Results would be distributed even, but clearly it is not random. -- Replies to list only preferred.