[ https://issues.apache.org/jira/browse/JOSHUA-341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794381#comment-16794381 ]
Thamme Gowda commented on JOSHUA-341: ------------------------------------- Here is another handy tool to consider. [https://github.com/isi-nlp/uroman] It uses Unicode tables and rules to transliterate non-roman script words to Roman script (No training needed) (Sorry, yet another Perl script, but *sometimes/most-times* this is all we need) > Integrated Transliteration > -------------------------- > > Key: JOSHUA-341 > URL: https://issues.apache.org/jira/browse/JOSHUA-341 > Project: Joshua > Issue Type: Task > Components: core, language packs > Reporter: Tommaso Teofili > Priority: Major > Labels: gsoc2019 > > Many of the language packs released translated from languages with non-Latin > scripts. Words that cannot be translated are therefore pushed through to the > translation and cannot even be read by someone who doesn't know that script. > At the same time, many untranslatable words are simply transliterated words. > For example, an Arabic word might be an English word (like a name or > technical term) that has simply been written in Arabic. These words can be > transliterated. It would be good to add built-in transliteration models that > can be applied to all out-of-vocabulary words and enabled for certain > languages. Transliteration models can be built over the same bitext using > techniques like Sajjad, Fraser, and Schmid (2012) [1]. > [1] : http://www.anthology.aclweb.org/P/P12/P12-1049.pdf -- This message was sent by Atlassian JIRA (v7.6.3#76005)