Dear support list
I am a PhD student in machine translation and as part of my research I would
want to compare some alignment tools such as MGIZA++, fast align, efmeral. The
easiest format to use for me would be the Pharaoh format, i.e. the format where
the source and target token indices are separated by a dash. For instance:
0-0 1-1 2-2 2-3 3-4 4-5
However, when having run(m) GIZA++, I can only find the extended format in the
A3 tables, which look like this:
# Sentence pair (1236078) source length 12 target length 17 alignment score :
1.07364e-29
zorg ervoor dat het voer of het extraatje dat het geneesmiddel bevat , volledig
wordt opgenomen .
NULL ({ 3 7 12 13 }) ensure ({ 1 2 }) the ({ 4 }) food ({ 5 }) or ({ 6 }) treat
({ 8 }) containing ({ 9 }) the ({ 10 }) medication ({ 11 }) is ({ 15 })
completely ({ 14 }) consumed ({ 16 }) . ({ 17 })
Where the first line is a comment with some info, the second is the target
sentence, and the last one is the source tokens and their alignments.
If I'm not mistaken, a full Moses training does produce the desired format so I
suspect that there is a script somewhere in Moses that converts the format -
but I cannot seem to find it. Could you perhaps nudge me in the right direction?
Kind regards
Bram Vanroy
Doctoral Researcher at Ghent University
Language and Translation Technology Team (LT³)
[email protected]<mailto:[email protected]>
https://research.flw.ugent.be/en/bram.vanroy
https://www.linkedin.com/in/bramvanroy/
https://github.com/BramVanroy/
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support