Hi, I tend to fix it in the tokenization script, or I would solve this in some pre-processing scripts if there are any obvious patterns in the noise.
-- Dingyuan 2015年11月26日 21:09於 "Vincent Nguyen" <[email protected]>寫道: > Hi all, > > I have a question regarding LMs. > > Let's take the example of news.2014.shuffle.en > > When we process it through punctuation normalization for english > language, it will for instance put a " " before an apostrophe > "it is'nt" = > "it is 'nt" > > BUT it contains some noise, for instance there is some french sentences > in the corpus, for which the apostrophe process will not be suited > "j'aime" => "j 'aime" => it will create the token 'aime > > My point is the following, > > At stage of LM building, how can we prune to eliminate such token like > "'aime" so that it does not create wrong uni-grams, nor bi-grams, ... > > the ngram -minprune only take 2 as a minimum so wrong unigrams will > still be taken in the LM. > > > Hope I'm clear enough .... > > Vincent > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
