Dear Ksenia, Thanks for your message. The ideas page is not too useful, I know, because it is not completely written, but I will try to describe here the main goal of the work. The task is quite open as you can see.
There are weekly dumps. Each dump contains data for a language pair in various formats. Any If you take a look at the corpora for a particular language pair you will see triplets containing the original text segment, the machine translated segment (marked "mt", and the corrected segment (marked "user" in TMX files). As far as I know (but we can always ask, or someone from Wikimedia content translation (CX) subscribing this list can tell us), not all of these are generated from Apertium, but some are. Also, we know they are not using the last available versions of Apertium, but some earlier stable version. Also, edits can be incomplete (the text segment has not been completely postedited) or may go beyond translation postediting (to add information, etc.). This should be checked. The idea is to process the data to obtain information from these triplets that may be used to improve the Apertium language pair. This involves: (0) Checking if the machine translation segment can be obtained from the source using the current version of that language pair. If they match, this could make things easier. If not, it can still be used. (1) aligning the source and the machine-translated segment (one possible way is to translate all possible segments of one word, two words, three words, etc., and then look for them in the machine-translated output. There is (Python, I believe) code from previous GSoC or Google Code-In editions that I could dig out to do this. I even have some code that may be useful. (2) aligning the machine-translated and the postedited segment. One possibility is to use the Levenshtein edit distance, and then "phrase pair extraction" as it is done in phrase-based machine translation. I can also dig out some Python code that there may be around to do that. (3) Obtaining triplets (s,MT(s),t) where s is a source segment of one or more words, MT(s) its machine translation using the current version of Apertium and t the actual text found in the postedited text. Ideally, they should contain an unknown word on the MT side and its correction. (4) Once you have such triplets for a whole dump, do some statistics (you're going to get quite a few) to mine for those kinds of changes that may be useful to produce entries to improve Apertium. (5) Researching ways to actually produce possible entries for Apertium from these triplets. I hope you have followed me up to here. If not, feel free to ask if you are still interested in the task. Cheers Mikel El 22/03/17 a les 14:57, Ксения Сухова ha escrit: > Hi, > > I am interested in «Improving language pairs mining MediaWiki Content > Translations postedits» project. Is it possible to get more information about > it? > Thank you. > > Sincerely yours, > Ksenia > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Apertium-stuff mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/apertium-stuff -- Mikel L. Forcada http://www.dlsi.ua.es/~mlf/ Departament de Llenguatges i Sistemes Informàtics Universitat d'Alacant E-03690 Sant Vicent del Raspeig Spain Office: +34 96 590 9776 ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
