Hi all,

I have a question regarding LMs.

Let's take the example of news.2014.shuffle.en

When we process it through punctuation normalization for english 
language, it will for instance put a " " before an apostrophe
"it is'nt" = > "it is 'nt"

BUT it contains some noise, for instance there is some french sentences 
in the corpus, for which the apostrophe process will not be suited
"j'aime" => "j 'aime" => it will create the token 'aime

My point is the following,

At stage of LM building, how can we prune to eliminate such token like 
"'aime" so that it does not create wrong uni-grams, nor bi-grams, ...

the ngram -minprune only take 2 as a minimum so wrong unigrams will 
still be taken in the LM.


Hope I'm clear enough ....

Vincent
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to