--- "J Storrs Hall, PhD" <[EMAIL PROTECTED]> wrote:

> On Tuesday 10 July 2007 03:07:40 pm Matt Mahoney wrote:
> 
> > ...  I wanted to use 1 GB of text because it is the amount of language
> > that the average human is exposed to by adulthood.  
> >... .  My main objection is using a 100 MB data 
> > set, which is equivalent to the language model of a 2-3 year old child.
> 
> Hmmph. I could do a hell of a lot better than 1 bpc on what I heard up to
> the 
> age of 3... "Say 'Thank you!'" 25,000 times, "Go potty?" 10,000 times, "Wash
> your hands and come to dinner!" 3000 times, "Brush your teeth!" another
> 3000, 
> etc ad nauseum...

Shannon's estimate of 1 bit per character was for adult-level written English,
and was about the same for novels like "Jefferson the Virginian", and aircraft
technical manuals.  When I researched this I found that models based on normal
conversational speech had about half the entropy of either written English or
prepared, broadcast speech (e.g. news).



-- Matt Mahoney, [EMAIL PROTECTED]

-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/?member_id=231415&id_secret=19012717-8fc4d5

Reply via email to