Hi all,

At the moment I'm experimenting with corpus files that contain placeholders.
Since I'm not a very experienced user, I'd like to ask for some advice. Did
anyone already experimented with that?
At first sight, I was thinking of removing all instances of placeholders,
but they make up around 10 % of the corpus files. So I'd like to keep them
for training, as in a lot of cases they would represent words, e.g.:

Original text strings:
See <ph x="1">{1}</ph> and <ph x="2">{2}</ph>.

Removed markup:
See {1} and {2}.

When I'd remove the placeholders, the sentence structure gets obviously
broken. Broken sentences should be quite problematic, shouldn't they?
Other instances of placeholders appear to be meant inline elements, e. g.

Select an <ph x="1">{1}</ph>option<ph x="2">{2}</ph> from the context menu.

Select an {1}option{2} from the context menu.

My strategy would be to add these placeholders to the list of non-breaking
prefixes in order to have them treated like words. Then setting the right
distortion value should do the trick, to keep them in place. Is this a good
idea?

Best regards,
Daniel

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to