Re: [agi] Language modeling

Matt Mahoney Wed, 25 Oct 2006 10:10:00 -0700

----- Original Message ----
From: Richard Loosemore <[EMAIL PROTECTED]>
To: [email protected]
Sent: Tuesday, October 24, 2006 12:37:16 PM
Subject: Re: [agi] Language modeling

>Matt Mahoney wrote:
>> Converting natural language to a formal representation requires language 
>> modeling at the highest level.  The levels from lowest to highest are: 
>> phonemes, word segmentation rules, semantics, simple sentences, compound 
>> sentences.  Regardless of whether your child learned to read at age 3 or 
>> not at all, children always learn language in this order.

>And the evidence for this would be what?

Um, any textbook on psycholinguistics or developmental psychology, also the 
paper by Jusczyk I cited earlier.  Ben pointed me to a book by Tomasello which 
I haven't read, but here is a good summary of his work on language acquisition 
in children.
http://email.eva.mpg.de/~tomas/pdf/Mussen_chap_proofs.pdf

I realize that the stages of language learning overlap, but they do not all 
start at the same time.  It is a simple fact that children learn words with 
semantic content like "ball" or "milk" before they learn function words like 
"the" or "of", in spite of the higher frequency of the latter.  Likewise, 
successful language models used for information retrieval ignore function words 
and word order. Furthermore, children learn word segmentation rules before they 
learn words, again consistent with statistical language models.  (The fact that 
children can learn sign language at 6 months is not inconsistent with these 
models.  Sign language does not have the word segmentation problem).

We can learn from these observations.  One conclusion that I draw is that you 
can't build an AGI and tack on language modeling later.  You have to integrate 
language modeling and train it in parallel with nonverbal skills such as vision 
and motor control, similar to training a child.  We don't know today whether 
this will turn out to be true.

Another important question is: how much will this cost?  How much CPU, memory, 
and training data do you need?  Again we can use cognitive models to help 
answer these questions.  According to Tomasello, children are exposed to about 
5000 to 7000 utterances per day, or about 20,000 words.  This is equivalent to 
about 100 MB of text in 3 years.  Children learn to use simple sentences of the 
form (subject-verb-object) and recognize word order in these sentences at about 
22-24 months.  For example, they respond correctly to "make the bunny push the 
horse".  However, such models are word specific.  At about age 3 1/2, children 
are able to generalize novel words used in context as a verb to other syntactic 
constructs, e.g. to construct transitive sentences given examples where the 
verb is used only intransitively.  This is about the state of the art with 
statistical models trained on hundreds of megabytes of text.  Such experiments 
suggest that adult level modeling, which will be needed to interface with 
structured knowledge bases, will require about a gigabyte of training data.

-- Matt Mahoney, [EMAIL PROTECTED]

-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/[EMAIL PROTECTED]

Re: [agi] Language modeling

Reply via email to