Travis H. wrote: > Hi, > > I've been wondering about the proper application of statistics with > regard to comparing PRNGs and encrypted text to truly random sources. > > As I understand it, when looking at output, one can take a > hypothetical source model (e.g. "P(0) = 0.3, P(1) = 0.7, all bits > independent") and come up with a probability that the source may have > generated that output. One cannot, however, say what probability such > a source had generated the output, because there is an infinite number > of sources (e.g. "P(0) = 0.29999.., P(1) = 7.000..."). Can one say > that, if the source must be A or B, what probability it actually was A > (and if so, how)?
Short answer: you can say what evidence the observed output provides in favor of A vs B, and if you're willing to start with a prior belief (that A and B are equally likely before the observation, say), then you can calculate the probability after the observation: P(A) / P(B) = P(D | A) / P(D | B) (given that they were equally likely before seeing the data D) Long answer: There are a few philosophies on the correct way to think about such issues. The Bayesian inference approach would simply be to write the probability that a source model is correct (given some data) in terms of the probability that one would observe the data (given that source model) and the probability of the source model being correct, independent of the data (aka the prior), and the probability of the data given any model: P(M1 | D) = P(D | M1) * P(M1) / P(D) M1 = model 1, D = observed data. This equation is Bayes' Theorem, and is trivially derived from from the fact that P(M1 & D) = P(M1 | D) * P(D) = P(D | M1) * P(M1) Depending on the type of data, determining P(D) can be an exercise in philosophical futility, but note that in the ratio of two "posterior" probabilities (the terms on the LHS), the P(D) terms will cancel: P(M1 | D) / P(M2 | D) = P(D | M1) / P(D | M2) * P(M1) / P(M2) Now, if (say) you are using this as a spam filter, you might determine that the prior probability (P(M1)) for one model is different from the prior probability of a different model... for me, the odds that a given email is spam seems to be about 99%, so P(M1)/P(M2) might be about 0.01. For a cryptographic application, well, you'd want to figure out something sensible given the two models. People who dislike Bayesian inference tend to have an issue with the "arbitrary" choice of prior probabilities. Note that the ratio of the likelihoods (P(D | M1) is the likelihood of M1 given the data D) is a measure of the evidence in favor of M1 vs M2, provided the data D. Non-Bayesian approaches tend to use the log of this ratio as a statistic with a chi^2 distribution. Oh, and do beware the number of degrees of freedom in your two models- if they are not equal, you need to do a bit more work that outlined here. -Bill --------------------------------------------------------------------- The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]
