I am working on Arabic MT, using different tokenization schemes.
The different schemes result in different line lengths, which might cause
imbalances among the different options when I clean corpus, to eliminate
the lines beyond the length of 85 words.
In order to avoid this imbalance, let's say that I have 4 scheme ( A, B,
C,D), and I need to eliminate the lines across all files whose B scheme
exceeds 85 words.
How can I do that using clean-corpus-n.perl ?
Thanks for any help you may offer.
Moses-support mailing list