On Thu, Oct 23, 2008 at 12:55 AM, Matt Mahoney wrote: > > > I suppose you are right. Instead of encoding mathematical rules as a grammar, > with enough training > data you can just code all possible instances that are likely to be > encountered. For example, instead > of a grammar rule to encode the commutative law of addition, > > 5 + 3 = a + b = b + a = 3 + 5 > a model with a much larger training data set could just encode instances with > no generalization: > > 12 + 7 = 7 + 12 > 92 + 0.5 = 0.5 + 92 > etc. > > I believe this is how Google gets away with brute force n-gram statistics > instead of more sophisticated > grammars. It's language model is probably > 10^5 times larger than a human model (10^14 bits vs > 10^9 bits). Shannon observed in 1949 that random strings generated by n-gram > models of English > (where n is the number of either letters or words) look like natural language > up to length 2n. For a > typical human sized model (1 GB text), n is about 3 words. To model strings > longer than 6 words we > would need more sophisticated grammar rules. Google can model 5-grams (see > http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html > ), so it is able to > generate and recognize (thus appear to understand) sentences up to about 10 > words. >
Gigantic databases are indeed Google's secret sauce. See: <http://googleresearch.blogspot.com/2008/09/doubling-up.html> Quote: Monday, September 29, 2008 Posted by Franz Josef Och Machine translation is hard. Natural languages are so complex and have so many ambiguities and exceptions that teaching a computer to translate between them turned out to be a much harder problem than people thought when the field of machine translation was born over 50 years ago. At Google Research, our approach is to have the machines learn to translate by using learning algorithms on gigantic amounts of monolingual and translated data. Another knowledge source is user suggestions. This approach allows us to constantly improve the quality of machine translations as we mine more data and get more and more feedback from users. A nice property of the learning algorithms that we use is that they are largely language independent -- we use the same set of core algorithms for all languages. So this means if we find a lot of translated data for a new language, we can just run our algorithms and build a new translation system for that language. As a result, we were recently able to significantly increase the number of languages on translate.google.com. Last week, we launched eleven new languages: Catalan, Filipino, Hebrew, Indonesian, Latvian, Lithuanian, Serbian, Slovak, Slovenian, Ukrainian, Vietnamese. This increases the total number of languages from 23 to 34. Since we offer translation between any of those languages this increases the number of language pairs from 506 to 1122 (well, depending on how you count simplified and traditional Chinese you might get even larger numbers). --------- BillK ------------------------------------------- agi Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: https://www.listbox.com/member/?member_id=8660244&id_secret=117534816-b15a34 Powered by Listbox: http://www.listbox.com
