Hello everyone,

I'm currently working on experiments for my thesis on adding lexical
selection to Apertium.

As part of that work I've taken four language pairs (br-fr, eu-es,
mk-en, and en-es) and am trying to generate automatically lexical
selection rules for them from parallel corpora.

One of the problems I come across is often that words on the source
language side of the corpus are aligned to target language words which
are not in our bilingual dictionaries. And thus translations to these
words cannot be made with Apertium MT systems.

In some cases the problem is due to inadequate morphological
disambiguation, or translation divergence between source and target. 

But in other cases, it points to a real lack of translation in the
bilingual dictionary. E.g. 

   1642 !!!: Missing: siguiente<adj> not found for hurrengo<adj>

The proportion of sentences which have at least on ambiguous word in the
Apertium bilingual dictionaries that is found aligned in the parallel
corpus is as follows:

  en-es: 22.18%
  eu-es: 11.48%
  mk-en: 7.14%
  br-fr: 5.74% 

I've made lists of the most frequent "missing" translations, and put
them online here:

   http://xixona.dlsi.ua.es/~fran/missing/

If anyone is working on one of these language pairs and feels like
improving the chances of lexical selection for it, then please feel free
to add the translations which you think are missing :)

All the best,

Fran


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to