This message derives an integer interval mapping function that is not only provably within the given range but also "uniform". This is the function to use when mapping random integers from one interval to a smaller one.
Problem: given an integer n within [0..maxn], design a scaling function sc(n) that returns an integer within [s..t], t>=s. The function sc(n) must be 'monotonic': 0<=a < b ==> sc(a) <= sc(b) and it must map the ends of the interval: sc(0) -> s and sc(maxn) -> t. One such function has been designed by Simon Thompson in his book: sct(n) = n `div` ((maxn+1) `div` (t-s+1)) + s Another function has been derived in the previous message on this list: sc1(n) = (n*(t-s)) `div` maxn + s The sct() function has a serious drawback: it is not guaranteed that sct(n) <= t (more about it below). The sc1() function is provably within the given range, and monotonic. Alas, it is not "uniform." When it comes to mapping of random integers, the mapping function must possess an extra property: it must "map uniformly." That is, given random numbers within [0..maxn] such that Prob(n==i) - const, we must aim at Prob(sc(n)==i') - const forall i' in [s..t]. The mapping function sc1(n) does not have this property. Any mapping function sc(n) from [0..maxn] to [s..t] for (t-s+1) < (maxn+1) maps several integers from the source interval to one integer of the target interval. An integer of the target interval therefore denotes an equivalence class of the integers from the source interval. The uniformity property will be satisfied only if the equivalence classes have the same size. This is only possible if (maxn+1) `rem` (t-s+1) == 0. Given this condition, the original Thompson's algorithm works. If (t-s+1) divides (maxn+1) exactly (i.e., exists denom integer. denom*(t-s+1) == maxn+1), we have j*denom <= n <= j*denom+denom-1 ==> sct(n) = j+s, all j=0..(t-s) Thus each equivalence class has the size denom, and mapping is perfectly uniform. As a consequence, s <= sct(n) <= t, given our assumption (maxn+1) `rem` (t-s+1) == 0. If this assumption does not hold, sct(n) may be greater than t (which Dimitre Novatchev has demonstrated). Alas, in this case there is no perfectly uniform mapping. The Thompson's algorithm misbehaves. The worst case for the algorithm is when t = s + (maxn+1)/2. For example, this occurs when we try to map [0..65535] to [0..32768]. In this case, t-s+1 = 32769, and denom = 65536 `div` 32769 == 1, thus the mapping becomes sct(n) = n. In almost half of the time, this mapping exceeds the upper limit 32768. OTH, the mapping sc1(n) = n*(maxn+1)/2 `div` maxn + s behaves in this pathological case rather well. Not only sc1(n) <= t (this is always guaranteed) but also the mapping is uniform: all equivalence classes of the source interval [0..65535] except the last two classes have the size of two. Thus the best solution should be to "average" sc1(n) and sct(n). This "averaging" is done by the formula sca = (n*(t-s) + (n*(denom-1) `div` denom)) `div` maxn + s where denom = (maxn+1) `div` (t-s+1) It is easy to see that sca(0) -> s. sca(maxn) = (maxn*(t-s) + (maxn*(denom-1) `div` denom)) `div` maxn + s = (maxn*(t-s) + p) `div` maxn + s {where 0<= p = (maxn*(denom-1) `div` denom) <= maxn-1 } = t-s+s = t It is easy to show that sca(n) is monotonic (as defined above). If maxn+1 is evenly divisible by (t-s+1) (that is, denom*(t-s+1) == maxn+1), we have j*denom <= n <= j*denom+denom-1 ==> sca(n) = j+s, all j=0..(t-s) This requires a proof. We have: sca(j*denom) = (j*denom*(t-s) + (j*denom*(denom-1) `div` denom)) `div` maxn + s = (j*denom*(t-s) + j*(denom-1)) `div` maxn + s = j*(denom*(t-s) + denom-1) `div` maxn + s = j*maxn `div` maxn + s = j+s sca(j*denom+denom-1) = ((j*denom+denom-1)*(t-s) + ((j*denom+denom-1)*(denom-1) `div` denom)) `div` maxn + s = ((j*denom+denom-1)*(t-s) + (j+1)*(denom-1)-1) `div` maxn + s = (j*maxn + (denom-1)*(t-s+1)) `div` maxn + s = j+s The required result then follows by the monotonicity property. In the most pathological case of maxn= 65535, s=0, t= 32768. We have denom = 1 and sca(n) becomes sc1(n), which, in this case, behaves very well. Thus the sca mapping is the optimal one. It is guaranteed to keep the result within the specified bounds at all times. When the perfect uniformity of mapping is possible, it is always achieved. The implementation isn't too complex either. > scaleSequence :: Int -> Int -> [Int] -> [Int] > scaleSequence s t > = map scale > where > scale n = (n*range + ((n*(denom-1)) `div` denom)) `div` maxn + s > range = t - s > maxn = modulus - 1 > modulus = 65536 > denom = (maxn+1) `div` (t-s+1) Main> any(>966) (scaleSequence 1 966 [0..65535]) False Main> any(==966) (scaleSequence 1 966 [0..65535]) True Main> any(>32768) (scaleSequence 0 32768 [0..65535]) False > test = let sq = [32768-1..32768+1600] in > let result = > map (\x -> let t = length x in (t `seq` t)) $! groupBy (\(a,_) (b,_) -> a == >b) $! > sortBy (\(a,_) (b,_) -> compare a b) $! > zip (scaleSequence 0 32768 sq) $! sq > in (maximum $! result, minimum $! result) ==> (2,1) -- e.g., the mapping is roughly uniform _______________________________________________ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell