Hi,

I tend to fix it in the tokenization script, or I would solve this in some
pre-processing scripts if there are any obvious patterns in the noise.

--
Dingyuan
2015年11月26日 21:09於 "Vincent Nguyen" <[email protected]>寫道:

> Hi all,
>
> I have a question regarding LMs.
>
> Let's take the example of news.2014.shuffle.en
>
> When we process it through punctuation normalization for english
> language, it will for instance put a " " before an apostrophe
> "it is'nt" = > "it is 'nt"
>
> BUT it contains some noise, for instance there is some french sentences
> in the corpus, for which the apostrophe process will not be suited
> "j'aime" => "j 'aime" => it will create the token 'aime
>
> My point is the following,
>
> At stage of LM building, how can we prune to eliminate such token like
> "'aime" so that it does not create wrong uni-grams, nor bi-grams, ...
>
> the ngram -minprune only take 2 as a minimum so wrong unigrams will
> still be taken in the LM.
>
>
> Hope I'm clear enough ....
>
> Vincent
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to