On Thu, Oct 23, 2008 at 12:55 AM, Matt Mahoney wrote:
>
>
> I suppose you are right. Instead of encoding mathematical rules as a grammar, 
> with enough training
> data you can just code all possible instances that are likely to be 
> encountered. For example, instead
> of a grammar rule to encode the commutative law of addition,
>
>  5 + 3 = a + b = b + a = 3 + 5
> a model with a much larger training data set could just encode instances with 
> no generalization:
>
>  12 + 7 = 7 + 12
>  92 + 0.5 = 0.5 + 92
>  etc.
>
> I believe this is how Google gets away with brute force n-gram statistics 
> instead of more sophisticated > grammars. It's language model is probably 
> 10^5 times larger than a human model (10^14 bits vs
> 10^9 bits). Shannon observed in 1949 that random strings generated by n-gram 
> models of English
> (where n is the number of either letters or words) look like natural language 
> up to length 2n. For a
> typical human sized model (1 GB text), n is about 3 words. To model strings 
> longer than 6 words we
> would need more sophisticated grammar rules. Google can model 5-grams (see
> http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
>  ), so it is able to
> generate and recognize (thus appear to understand) sentences up to about 10 
> words.
>


Gigantic databases are indeed Google's secret sauce.
See:
<http://googleresearch.blogspot.com/2008/09/doubling-up.html>

Quote:
Monday, September 29, 2008   Posted by Franz Josef Och

Machine translation is hard. Natural languages are so complex and have
so many ambiguities and exceptions that teaching a computer to
translate between them turned out to be a much harder problem than
people thought when the field of machine translation was born over 50
years ago. At Google Research, our approach is to have the machines
learn to translate by using learning algorithms on gigantic amounts of
monolingual and translated data. Another knowledge source is user
suggestions. This approach allows us to constantly improve the
quality of machine translations as we mine more data and
get more and more feedback from users.

A nice property of the learning algorithms that we use is that they
are largely language independent -- we use the same set of core
algorithms for all languages. So this means if we find a lot of
translated data for a new language, we can just run our algorithms and
build a new translation system for that language.

As a result, we were recently able to significantly increase the number of
languages on translate.google.com. Last week, we launched eleven new
languages: Catalan, Filipino, Hebrew, Indonesian, Latvian, Lithuanian, Serbian,
Slovak, Slovenian, Ukrainian, Vietnamese. This increases the
total number of languages from 23 to 34.  Since we offer translation
between any of those languages this increases the number of language
pairs from 506 to 1122 (well, depending on how you count simplified
and traditional Chinese you might get even larger numbers).
---------


BillK


-------------------------------------------
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=8660244&id_secret=117534816-b15a34
Powered by Listbox: http://www.listbox.com

Reply via email to