Re: [Moses-support] Help on pipeline ....

Vincent Nguyen Thu, 17 Sep 2015 00:40:31 -0700

Well in fact, it comes to how to handle xml mark up in text to translateand obviously this is not trivial and no good solution.


If I am not mistaken this issue has been tackled in various circumstances :
- casmacat project / matecat_util project

- web translation in the contrib / web translate.cgi script (whichhandles html tags)

I found old posts from Philipp saying the exact location of mark upscould be determined with phrase and word alignment .....


well if I am not mistaken :

let's take this example :
<g id="1">Les banques de la zone <g id="2">euro</g> sont soumises :</g>

only the word "euro" is tagged g2

if "zone euro" is phrase translated to "euro zone" then Moses will onlysend back the phrase alignment. So we lose the information

and we cannot tag g2 the word "euro".

I guess to prevent this one did try to get the Word alignment usingibm1align.pyin thishttps://github.com/casmacat/moses-mt-server/tree/master/code/tags4moses

BUT it requires the lex vocabulary files which is to me not a goodsolution since it must temper a lot the performance.

In the end, I don't really see any good solution without making Mosesxml aware .....


big debate ?




Le 16/09/2015 17:30, Vincent Nguyen a écrit :

I am struggling with a pipeline .....

Here is the text1.txt file I would like to translate from FR to EN
<g id="1">Les banques de la zone euro sont soumises :</g>
<g id="1">au ratio de capital lié à la détention d’actifs risqués(nous nous intéressons ici au crédit) ;</g><g id="1">au ratio de levier, qui détermine le capital règlementaire àpartir de la taille du bilan de la banque ;</g><g id="1">au ratio de liquidité, qui impose aux banques de détenir enparticulier des portefeuilles importants de titres publics.</g>
I am running the following properly :
/home/moses/mosesdecoder/scripts/tokenizer/normalize-punctuation.perlfr < text1.txt > text2.txt/home/moses/matecat/matecat_util/code/tokenizer/deescape-special-chars.perl< text2.txt > text3.txt/home/moses/matecat/matecat_util/code/tokenizer/tokenizer.perl -X -a-l fr < text3.txt > text4.txt/home/moses/mosesdecoder/scripts/recaser/truecase.perl --model/home/moses/working/truecaser/truecase-model.1.fr < text4.txt > text5.txt/home/moses/mosesdecoder/bin/moses -f/home/moses/working/tuning/moses.tuned.ini.1 < text5.txt > text6.txt
then in my text6.txt I have

<g id="1"> banks in the euro zone are subject :</g>
<g id="1"> ratio of capital linked to the detention of risky assets (we are here to credit ;</g> )<g id="1"> the leverage ratio , which determines the regulatorycapital from the size of the balance sheet of the bank ;</g><g id="1"> ratio of liquidity , which requires banks to holdespecially important portfolios of securities .</g> public
but then neither the detokenizer nor the detruecaser will give me thecorrect output.
"banks" will not get the uppercase B
I also tried to look at thishttps://github.com/christianbuck/matecat_util/tree/master/python_serveror thishttps://github.com/christianbuck/matecat_util/tree/master/code/tags4moses
but no luck.




_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Help on pipeline ....

Reply via email to