Jason Stover <[email protected]> writes: > Very creative, Ben. It's definitely worth a paper if it isn't already > published.
I'm not aware of a publication. I invented it myself by scribbling lots of stuff on paper, but that doesn't mean much. The page at http://comp9.psych.cornell.edu/Darlington/wilcoxon/wilcox0.htm seems to be by someone who has thought a bit about this, but he doesn't seem to have a good solution. I don't fully understand what it's doing, but the implementation in GNU R seems to be slow also. Where would you publish it, by the way? I don't know the statistics field. > I've read through the code of ranksum6 below, and it passed the few > tests I ran. Walking through the algorithm, it always seems to give > the correct value. Here's an explanation. For integers N and W, with N >= 1, let the function S(N,W) be the number of subsets of 1, 2, 3, ..., N that sum to at least W. There are 2**N subsets of N items, so S(N,W) is in the range 0...2**N. There are a few trivial cases: * For W <= 0, S(N,W) = 2**N. * For W > N*(N+1)/2, S(N,W) = 0. * S(1,1) = 1. Notably, these trivial cases include all values of W for N = 1. Now consider the remaining, nontrivial cases, that is, N > 1 and 1 <= W <= N*(N+1)/2. In this case, apply the following identity: S(N,W) = S(N-1, W) + S(N-1, W-N). The first term on the right hand is the number of subsets that do not include N that sum to at least W; the second term is the number of subsets that do include N that sum to at least W. Then we repeatedly apply the identity to the result, reducing the value of N by 1 each time until we reach N=1. Some expansions yield trivial cases, e.g. if W - N <= 0 (in which case we add a 2**N term to the final result) or if W is greater than the new N. Here is an example: S(7,7) = S(6,7) + S(6,0) = S(6,7) + 64 = (S(5,7) + S(5,1)) + 64 = (S(4,7) + S(4,2)) + (S(4,1) + S(4,0)) + 64 = S(4,7) + S(4,2) + S(4,1) + 80 = (S(3,7) + S(3,3)) + (S(3,2) + S(3,2)) + (S(3,1) + S(3,0)) + 80 = S(3,3) + 2*S(3,2) + S(3,1) + 88 = (S(2,3) + S(2,0)) + 2*(S(2,2) + S(2,0)) + (S(2,1) + S(2,0)) + 88 = S(2,3) + 2*S(2,2) + S(2,1) + 104 = (S(1,3) + S(1,1)) + 2*(S(1,2) + S(1,0)) + (S(1,1) + S(2,0)) + 104 = 2*S(1,1) + 112 = 114 Does this make the technique clear? The actual implementation is mildly clever, but it should be possible to puzzle it out after the explanation above. > But I would have to think more about the algorithm to convince myself > it will always be correct. Or see a written explanation. ('sorry Ben, > I know how you feel about being able figure out what code is supposed > to do just by reading it.) Have I said that? Code can be very obscure, bad enough that it is unintelligible. Maybe I meant that *good* code should be *written* so that it can figured out just from reading. This code, on the other hand, is the programming equivalent of hastily scribbled notes, which is mostly because I wrote it at 5:30 am after the baby woke me up and I had trouble getting back to sleep. -- "...dans ce pays-ci il est bon de tuer de temps en temps un amiral pour encourager les autres." --Voltaire, _Candide_ _______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
