--- "YKY (Yan King Yin)" <[EMAIL PROTECTED]> wrote: > Headline: "Employees of a new plan to get Dell back on the road to growth, > including streamlining management and looking at new methods of distribution > beyond the computer company's direct-selling model." > Can a baby really learn from THIS ^^^ ?
Yes, but not in the sense of what an adult would learn. Jusczyk showed that babies learn the rules for segmenting adult-level continuous speech into words by age 7-10 months. Hutchens and Alder showed that the rules for segmenting text without spaces could be learned from simple n-gram statistics. Essentially, the conditional entropy of a letter (or phoneme) is higher if the context is separated by a word boundary. I did some experiments which improved on their work for my dissertation proposal in 1999. http://cs.fit.edu/~mmahoney/dissertation/lex1.html I think that you can build lexical, semantic, and syntactic models by training on adult level text. Google has already demonstrated an effective semantic model. Grammar induction is harder, but I think all the information you need is present in unlabeled text, namely: - parts of speech (noun, verb, plus finer distinctions) can be learned by clustering words in a vector space of immediate context. For example, "new plan" and "new methods" tells you that "plan" and "method" play the same grammatical role, "new N". - nonterminal symbols (NP, sentence, etc) can be induced from common sequences of other symbols, for example, "of a new plan", and "on the road" have the common structure "PREP ART NP", from which we can introduce a new symbol representing prepositional phrases. I think grammar is hard for at least 3 reasons. First, there are tens of thousands of rules, most of which apply in narrow cases, such as the idiom "on the road to growth". (Why can you substitute "growth" with "ruin" or "recovery", but not "swimming"?) Second, rules, even the few that apply broadly like NP -> ADJ N are really just guidelines, so what you are really doing is not parsing, but finding a compromise among thousands of probabilistic constraints. Third, you have to integrate other constraints from the lexical and semantic levels below and the more abstract levels above. I gave the example where the parse of "I ate pizza with X" depends on what X is (Bob, gusto, a fork, pepperoni). Eventually you need a world model, but the point where you add this nonverbal information depends on what you are trying to do. If you are building a robot, then you need to ground the semantic terms to the other senses early, word by word, before learning grammar, like when you show a ball to a baby and say "ball". If you are building a text-only application like a speech recognizier or language translator, then you can do a lot without any grounding. The model only needs to associate "ball" with words like "play" or "catch" or "round", which is easy to do using temporal association in a text stream. Eventually the language model (which is just a probability distribution) will stumble without a world model that includes arithmetic (p("3+1=4") > p("3+2=4")), a spatial model that includes human bodies (p("the man climbed a penny") < p("the man stepped on a penny")), and lots of other common sense information. The real question is how to acquire such knowledge. Some possible sources: - Existing text on the Internet. - Human tutoring through a text based interface. - Existing video and audio on the Internet. - A simulated 3-D model of the world. - Sensory/motor data acquired by a humanoid robot. - Sensory/motor data acquired by instrumented animals. The questions are: - How much data is required? - How much of the required data is currently available? - How much will it cost to generate the currently unavailable data? -- Matt Mahoney, [EMAIL PROTECTED] ----- This list is sponsored by AGIRI: http://www.agiri.org/email To unsubscribe or change your options, please go to: http://v2.listbox.com/member/?member_id=231415&user_secret=fabd7936
