On Thu, 6 Sep 2007 09:28:40 -0400 (EDT) "Leichter, Jerry" <[EMAIL PROTECTED]> wrote:
> | Hi Martin, > | > | I did forget to say that it would be salted so that throws it off by > | 2^12 > | > | A couple of questions. How did you come up with the ~2.5 bits per > | word? Would a longer word have more bits? > He misapplied an incorrect estimate! :-) The usual estimate - going > back to Shannon's original papers on information theory, actually - is > that natural English text has about 2.5 (I think it's usually given as > 2.4) bits of entropy per *character*. There are several problems > here: It's less than that. See, for example, the bottom of the first page of http://www.cs.brown.edu/courses/cs195-5/extras/shannon-1951.pdf : From this analysis it appears that, in ordinary literary English, the long range statistical effects (up to 100 letters) reduce the entropy to something of the order of one bit per letter, with a corresponding redundancy of roughly 75%. The redundancy may be still higher when structure extending over paragraphs, chapters, etc. is included. > > - The major one is that the estimate should be for > *characters*, not *words*. So the number of bits of entropy in > a 55-character phrase is about 137 (132, if you use > 2.4 bits/character), not 30. > > - The minor one is that the English entropy estimate looks > just at letters and spaces, not punctuation and capitalization. > So it's probably low anyway. However, this is a much > smaller effect. The interesting question is whether or not one can effectively enumerate candidate phrases for a guessing program. For that problem, punctuation and capitalization are important. --Steve Bellovin, http://www.cs.columbia.edu/~smb --------------------------------------------------------------------- The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]
