Well in fact, it comes to how to handle xml mark up in text to translate
and obviously this is not trivial and no good solution.
If I am not mistaken this issue has been tackled in various circumstances :
- casmacat project / matecat_util project
- web translation in the contrib / web translate.cgi script (which
handles html tags)
I found old posts from Philipp saying the exact location of mark ups
could be determined with phrase and word alignment .....
well if I am not mistaken :
let's take this example :
<g id="1">Les banques de la zone <g id="2">euro</g> sont soumises :</g>
only the word "euro" is tagged g2
if "zone euro" is phrase translated to "euro zone" then Moses will only
send back the phrase alignment. So we lose the information
and we cannot tag g2 the word "euro".
I guess to prevent this one did try to get the Word alignment using
ibm1align.py
in this
https://github.com/casmacat/moses-mt-server/tree/master/code/tags4moses
BUT it requires the lex vocabulary files which is to me not a good
solution since it must temper a lot the performance.
In the end, I don't really see any good solution without making Moses
xml aware .....
big debate ?
Le 16/09/2015 17:30, Vincent Nguyen a écrit :
I am struggling with a pipeline .....
Here is the text1.txt file I would like to translate from FR to EN
<g id="1">Les banques de la zone euro sont soumises :</g>
<g id="1">au ratio de capital lié à la détention d’actifs risqués
(nous nous intéressons ici au crédit) ;</g>
<g id="1">au ratio de levier, qui détermine le capital règlementaire à
partir de la taille du bilan de la banque ;</g>
<g id="1">au ratio de liquidité, qui impose aux banques de détenir en
particulier des portefeuilles importants de titres publics.</g>
I am running the following properly :
/home/moses/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl
fr < text1.txt > text2.txt
/home/moses/matecat/matecat_util/code/tokenizer/deescape-special-chars.perl
< text2.txt > text3.txt
/home/moses/matecat/matecat_util/code/tokenizer/tokenizer.perl -X -a
-l fr < text3.txt > text4.txt
/home/moses/mosesdecoder/scripts/recaser/truecase.perl --model
/home/moses/working/truecaser/truecase-model.1.fr < text4.txt > text5.txt
/home/moses/mosesdecoder/bin/moses -f
/home/moses/working/tuning/moses.tuned.ini.1 < text5.txt > text6.txt
then in my text6.txt I have
<g id="1"> banks in the euro zone are subject :</g>
<g id="1"> ratio of capital linked to the detention of risky assets (
we are here to credit ;</g> )
<g id="1"> the leverage ratio , which determines the regulatory
capital from the size of the balance sheet of the bank ;</g>
<g id="1"> ratio of liquidity , which requires banks to hold
especially important portfolios of securities .</g> public
but then neither the detokenizer nor the detruecaser will give me the
correct output.
"banks" will not get the uppercase B
I also tried to look at this
https://github.com/christianbuck/matecat_util/tree/master/python_server
or this
https://github.com/christianbuck/matecat_util/tree/master/code/tags4moses
but no luck.
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support