Hi all, I have a question regarding LMs.
Let's take the example of news.2014.shuffle.en When we process it through punctuation normalization for english language, it will for instance put a " " before an apostrophe "it is'nt" = > "it is 'nt" BUT it contains some noise, for instance there is some french sentences in the corpus, for which the apostrophe process will not be suited "j'aime" => "j 'aime" => it will create the token 'aime My point is the following, At stage of LM building, how can we prune to eliminate such token like "'aime" so that it does not create wrong uni-grams, nor bi-grams, ... the ngram -minprune only take 2 as a minimum so wrong unigrams will still be taken in the LM. Hope I'm clear enough .... Vincent _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support