Hi, I am working on Arabic MT, using different tokenization schemes. The different schemes result in different line lengths, which might cause imbalances among the different options when I clean corpus, to eliminate the lines beyond the length of 85 words. In order to avoid this imbalance, let's say that I have 4 scheme ( A, B, C,D), and I need to eliminate the lines across all files whose B scheme exceeds 85 words. How can I do that using clean-corpus-n.perl ?
Thanks for any help you may offer. Best Regards.
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
