Dear support list

I am a PhD student in machine translation and as part of my research I would 
want to compare some alignment tools such as MGIZA++, fast align, efmeral. The 
easiest format to use for me would be the Pharaoh format, i.e. the format where 
the source and target token indices are separated by a dash. For instance:

0-0 1-1 2-2 2-3 3-4 4-5

However, when having run(m) GIZA++, I can only find the extended format in the 
A3 tables, which look like this:

# Sentence pair (1236078) source length 12 target length 17 alignment score : 
1.07364e-29
zorg ervoor dat het voer of het extraatje dat het geneesmiddel bevat , volledig 
wordt opgenomen .
NULL ({ 3 7 12 13 }) ensure ({ 1 2 }) the ({ 4 }) food ({ 5 }) or ({ 6 }) treat 
({ 8 }) containing ({ 9 }) the ({ 10 }) medication ({ 11 }) is ({ 15 }) 
completely ({ 14 }) consumed ({ 16 }) . ({ 17 })

Where the first line is a comment with some info, the second is the target 
sentence, and the last one is the source tokens and their alignments.

If I'm not mistaken, a full Moses training does produce the desired format so I 
suspect that there is a script somewhere in Moses that converts the format - 
but I cannot seem to find it. Could you perhaps nudge me in the right direction?


Kind regards

Bram Vanroy

Doctoral Researcher at Ghent University
Language and Translation Technology Team (LT³)
[email protected]<mailto:[email protected]>

https://research.flw.ugent.be/en/bram.vanroy
https://www.linkedin.com/in/bramvanroy/
https://github.com/BramVanroy/

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to