Well in fact, it comes to how to handle xml mark up in text to translate and obviously this is not trivial and no good solution.

If I am not mistaken this issue has been tackled in various circumstances :
- casmacat project / matecat_util project
- web translation in the contrib / web translate.cgi script (which handles html tags)

I found old posts from Philipp saying the exact location of mark ups could be determined with phrase and word alignment .....

well if I am not mistaken :

let's take this example :
<g id="1">Les banques de la zone <g id="2">euro</g> sont soumises :</g>

only the word "euro" is tagged g2

if "zone euro" is phrase translated to "euro zone" then Moses will only send back the phrase alignment. So we lose the information
and we cannot tag g2 the word "euro".

I guess to prevent this one did try to get the Word alignment using ibm1align.py in this https://github.com/casmacat/moses-mt-server/tree/master/code/tags4moses

BUT it requires the lex vocabulary files which is to me not a good solution since it must temper a lot the performance.

In the end, I don't really see any good solution without making Moses xml aware .....

big debate ?




Le 16/09/2015 17:30, Vincent Nguyen a écrit :

I am struggling with a pipeline .....

Here is the text1.txt file I would like to translate from FR to EN
<g id="1">Les banques de la zone euro sont soumises :</g>
<g id="1">au ratio de capital lié à la détention d’actifs risqués (nous nous intéressons ici au crédit) ;</g> <g id="1">au ratio de levier, qui détermine le capital règlementaire à partir de la taille du bilan de la banque ;</g> <g id="1">au ratio de liquidité, qui impose aux banques de détenir en particulier des portefeuilles importants de titres publics.</g>

I am running the following properly :

/home/moses/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl fr < text1.txt > text2.txt /home/moses/matecat/matecat_util/code/tokenizer/deescape-special-chars.perl < text2.txt > text3.txt /home/moses/matecat/matecat_util/code/tokenizer/tokenizer.perl -X -a -l fr < text3.txt > text4.txt /home/moses/mosesdecoder/scripts/recaser/truecase.perl --model /home/moses/working/truecaser/truecase-model.1.fr < text4.txt > text5.txt /home/moses/mosesdecoder/bin/moses -f /home/moses/working/tuning/moses.tuned.ini.1 < text5.txt > text6.txt

then in my text6.txt I have

<g id="1"> banks in the euro zone are subject :</g>
<g id="1"> ratio of capital linked to the detention of risky assets ( we are here to credit ;</g> ) <g id="1"> the leverage ratio , which determines the regulatory capital from the size of the balance sheet of the bank ;</g> <g id="1"> ratio of liquidity , which requires banks to hold especially important portfolios of securities .</g> public

but then neither the detokenizer nor the detruecaser will give me the correct output.
"banks" will not get the uppercase B


I also tried to look at this https://github.com/christianbuck/matecat_util/tree/master/python_server or this https://github.com/christianbuck/matecat_util/tree/master/code/tags4moses

but no luck.




_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to