On 08/06/2010 07:47 PM, Michael Galvez wrote: > 3. We acquire dictionaries on limited licenses from other parties. In > general, while we can surface this content on our own sites (e.g., Google > Translate, Google Dictionary, Google Translator Toolkit), we don't have > permission to donate that data to other sites.
Google, as any large company, uses many sources. For example, Google Maps used to buy all its maps, but later started to drive around to build its own maps (and street images). With time, I'm certain you will use Google Books as a parallel corpus and derive translations of words and phrases from translated books, and some day you might be able to build Google Translate without relying on external dictionary sources. I don't know if this is one month or one year away, but it should take less than one decade. Expecting this development, you could keep collaboration with open content movements, such as Wikipedia/Wiktionary in mind. > For HTML files, both Translate and Translator Toolkit support the tag > > class="notranslate" > > to exclude text from translation. ( > http://translate.google.com/support/toolkit/bin/answer.py?hl=en&answer=147838 > ) > > If you tell us what MediaWiki tags you'd like for us to treat the same way, > we can do the same for Wikipedia. There is no such tag, unfortunately. But in the GTTK user interface, it would be useful to have a way to mark where in the original text (left-hand side) those tags should have been. If it is any help to the pretranslator, other kinds of marks could also be manually added, such as whether a phrase is a figure of speech or should be read literally. If the text says "kill two birds with one stone", that should be translated into Swedish as "hit two flies with one swat". But if David slays Goliath with a stone, that should remain a stone. > a. If we find a translation for that segment in the TM, we will > "pre-translate" the segment with the highest-rated translation. But when you have two or more candidates, each with a reasonable probability, the choice could be presented to the human translator. > 1. When a translator uploads a WIkipedia article into Translator Toolkit, we > divide the article into segments. (sentences, section headings, etc.) This means you do recognize some wiki markup, such as [[links]] and ==headings==. But recognition of that markup is apparently hard-wired and takes place before any learning. Now, consider the case when '''John Doe''' (May 1, 1733 - April 5, 1799) was a British colonel is translated, according to our manual of style, as: '''John Doe,''' född 1 maj 1733, död 5 april 1799, var en brittisk överste where the parentheses are replaced with commas and the words född (born) and död (died) have been added. It would be nice if the translation memory could learn not only the words (colonel = överste) but also to recognize this transformation of style. It is very context sensitive (this example only applies to the opening paragraph of biographic articles) and would need lots of translations to provide good results. And including dashes, commas and parentheses along with words as the elements of translated phrases is perhaps a major shift in what machine translation is supposed to do. (But it could open the door to translating template calls.) > Following interwiki links and suggesting parent categories is a bit of work > and unlikely to be implemented soon. We can disable category translation if > that helps - can you confirm if that's OK? I think you should keep it as it is, until you get around to do that "bit of work". -- Lars Aronsson ([email protected]) Aronsson Datateknik - http://aronsson.se _______________________________________________ foundation-l mailing list [email protected] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
