--- "J Storrs Hall, PhD" <[EMAIL PROTECTED]> wrote: > On Tuesday 10 July 2007 03:07:40 pm Matt Mahoney wrote: > > > ... I wanted to use 1 GB of text because it is the amount of language > > that the average human is exposed to by adulthood. > >... . My main objection is using a 100 MB data > > set, which is equivalent to the language model of a 2-3 year old child. > > Hmmph. I could do a hell of a lot better than 1 bpc on what I heard up to > the > age of 3... "Say 'Thank you!'" 25,000 times, "Go potty?" 10,000 times, "Wash > your hands and come to dinner!" 3000 times, "Brush your teeth!" another > 3000, > etc ad nauseum...
Shannon's estimate of 1 bit per character was for adult-level written English, and was about the same for novels like "Jefferson the Virginian", and aircraft technical manuals. When I researched this I found that models based on normal conversational speech had about half the entropy of either written English or prepared, broadcast speech (e.g. news). -- Matt Mahoney, [EMAIL PROTECTED] ----- This list is sponsored by AGIRI: http://www.agiri.org/email To unsubscribe or change your options, please go to: http://v2.listbox.com/member/?member_id=231415&id_secret=19012717-8fc4d5
