Hello everyone, I'm currently working on experiments for my thesis on adding lexical selection to Apertium.
As part of that work I've taken four language pairs (br-fr, eu-es, mk-en, and en-es) and am trying to generate automatically lexical selection rules for them from parallel corpora. One of the problems I come across is often that words on the source language side of the corpus are aligned to target language words which are not in our bilingual dictionaries. And thus translations to these words cannot be made with Apertium MT systems. In some cases the problem is due to inadequate morphological disambiguation, or translation divergence between source and target. But in other cases, it points to a real lack of translation in the bilingual dictionary. E.g. 1642 !!!: Missing: siguiente<adj> not found for hurrengo<adj> The proportion of sentences which have at least on ambiguous word in the Apertium bilingual dictionaries that is found aligned in the parallel corpus is as follows: en-es: 22.18% eu-es: 11.48% mk-en: 7.14% br-fr: 5.74% I've made lists of the most frequent "missing" translations, and put them online here: http://xixona.dlsi.ua.es/~fran/missing/ If anyone is working on one of these language pairs and feels like improving the chances of lexical selection for it, then please feel free to add the translations which you think are missing :) All the best, Fran ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
