--- "YKY (Yan King Yin)" <[EMAIL PROTECTED]> wrote:
> Headline: "Employees of a new plan to get Dell back on the road to growth,
> including streamlining management and looking at new methods of distribution
> beyond the computer company's direct-selling model."
> Can a baby really learn from THIS ^^^ ?

Yes, but not in the sense of what an adult would learn.  Jusczyk showed that 
babies learn the rules for segmenting adult-level continuous speech into words
by age 7-10 months.  Hutchens and Alder showed that the rules for segmenting
text without spaces could be learned from simple n-gram statistics. 
Essentially, the conditional entropy of a letter (or phoneme) is higher if the
context is separated by a word boundary.  I did some experiments which
improved on their work for my dissertation proposal in 1999. 
http://cs.fit.edu/~mmahoney/dissertation/lex1.html

I think that you can build lexical, semantic, and syntactic models by training
on adult level text.  Google has already demonstrated an effective semantic
model.  Grammar induction is harder, but I think all the information you need
is present in unlabeled text, namely:

- parts of speech (noun, verb, plus finer distinctions) can be learned by
clustering words in a vector space of immediate context.  For example, "new
plan" and "new methods" tells you that "plan" and "method" play the same
grammatical role, "new N".

- nonterminal symbols (NP, sentence, etc) can be induced from common sequences
of other symbols, for example, "of a new plan", and "on the road" have the
common structure "PREP ART NP", from which we can introduce a new symbol
representing prepositional phrases.

I think grammar is hard for at least 3 reasons.  First, there are tens of
thousands of rules, most of which apply in narrow cases, such as the idiom "on
the road to growth".  (Why can you substitute "growth" with "ruin" or
"recovery", but not "swimming"?)  Second, rules, even the few that apply
broadly like NP -> ADJ N are really just guidelines, so what you are really
doing is not parsing, but finding a compromise among thousands of
probabilistic constraints.  Third, you have to integrate other constraints
from the lexical and semantic levels below and the more abstract levels above.
 I gave the example where the parse of "I ate pizza with X" depends on what X
is (Bob, gusto, a fork, pepperoni).

Eventually you need a world model, but the point where you add this nonverbal
information depends on what you are trying to do.  If you are building a
robot, then you need to ground the semantic terms to the other senses early,
word by word, before learning grammar, like when you show a ball to a baby and
say "ball".  If you are building a text-only application like a speech
recognizier or language translator, then you can do a lot without any
grounding.  The model only needs to associate "ball" with words like "play" or
"catch" or "round", which is easy to do using temporal association in a text
stream.

Eventually the language model (which is just a probability distribution) will
stumble without a world model that includes arithmetic (p("3+1=4") >
p("3+2=4")), a spatial model that includes human bodies (p("the man climbed a
penny") < p("the man stepped on a penny")), and lots of other common sense
information.  The real question is how to acquire such knowledge.  Some
possible sources:

- Existing text on the Internet.
- Human tutoring through a text based interface.
- Existing video and audio on the Internet.
- A simulated 3-D model of the world.
- Sensory/motor data acquired by a humanoid robot.
- Sensory/motor data acquired by instrumented animals.

The questions are:

- How much data is required?
- How much of the required data is currently available?
- How much will it cost to generate the currently unavailable data?


-- Matt Mahoney, [EMAIL PROTECTED]

-----
This list is sponsored by AGIRI: http://www.agiri.org/email
To unsubscribe or change your options, please go to:
http://v2.listbox.com/member/?member_id=231415&user_secret=fabd7936

Reply via email to