Dear Ksenia,

Thanks for your message. The ideas page is not too useful, I know, 
because it is not completely written, but I will try to describe here 
the main goal of the work. The task is quite open as you can see.

There are weekly dumps. Each dump contains data for a language pair in 
various formats. Any If you take a look at the corpora for a particular 
language pair you will see triplets containing the original text 
segment, the machine translated segment (marked "mt", and the corrected 
segment (marked "user" in TMX files).

As far as I know (but we can always ask, or someone from Wikimedia 
content translation (CX) subscribing this list can tell us), not all of 
these are generated from Apertium, but some are. Also, we know they are 
not using the last available versions of Apertium, but some earlier 
stable version.

Also, edits can be incomplete (the text segment has not been completely 
postedited) or may go beyond translation postediting (to add 
information, etc.). This should be checked.

The idea is to process the data to obtain information from these 
triplets that may be used to improve the Apertium language pair.

This involves:

(0) Checking if the machine translation segment can be obtained from the 
source using the current version of that language pair. If they match, 
this could make things easier. If not, it can still be used.

(1) aligning the source and the machine-translated segment (one possible 
way is to translate all possible segments of one word, two words, three 
words, etc., and then look for them in the machine-translated output. 
There is (Python, I believe) code from previous GSoC or Google Code-In 
editions that I could dig out to do this. I even have some code that may 
be useful.

(2) aligning the machine-translated and the postedited segment. One 
possibility is to use the Levenshtein edit distance, and then "phrase 
pair extraction" as it is done in phrase-based machine translation. I 
can also dig out some Python code that there may be around to do that.

(3) Obtaining triplets (s,MT(s),t) where s is a source segment of one or 
more words, MT(s) its machine translation using the current version of 
Apertium and t the actual text found in the postedited text. Ideally, 
they should contain an unknown word on the MT side and its correction.

(4) Once you have such triplets for a whole dump, do some statistics 
(you're going to get quite a few) to mine for those kinds of changes 
that may be useful to produce entries to improve Apertium.

(5) Researching ways to actually produce possible entries for Apertium 
from these triplets.

I hope you have followed me up to here.  If not, feel free to ask if you 
are still interested in the task.

Cheers

Mikel

El 22/03/17 a les 14:57, Ксения Сухова ha escrit:
> Hi,
>
> I am interested in «Improving language pairs mining MediaWiki Content 
> Translations postedits» project. Is it possible to get more information about 
> it?
> Thank you.
>
> Sincerely yours,
> Ksenia
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Apertium-stuff mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff

-- 
Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03690 Sant Vicent del Raspeig
Spain
Office: +34 96 590 9776


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to