The Google n-grams are tiny: 1.1 billion 5-grams while I have 263 billion.

They thresholded at 40 (and 200 for vocabulary words).  Thresholding
basically means it's only useful for stupid backoff.  Also, they didn't
deduplicate the data before training.

Would you like an unpruned interpolated modified Kneser-Ney language
model with these n-gram counts trained on more data than Google used?

1 2640258088
2 15297753348
3 61858786129
4 156775272110
5 263690452834

RandLM implements stupid backoff.  KenLM does not; my plan is to remove
the use case for stupid backoff.

Kenneth

On 03/18/14 20:39, Hieu Hoang wrote:
> Moses supports RandLM and neural network LM which can handle very large
> amounts of data, I think.
> 
> I'm not sure if IRSTLM or KenLM can handle Google ngram data, but I know
> they can handle large amount of data
> 
> 
> On 17 March 2014 14:56, Zheng Yuan <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Hi,
> 
>     I am wondering is it possible for Moses to use other kinds of LMs?
>     Like some existing Web interface or Google n-gram?
> 
>     Regards,
>     Zheng
> 
>     _______________________________________________
>     Moses-support mailing list
>     [email protected] <mailto:[email protected]>
>     http://mailman.mit.edu/mailman/listinfo/moses-support
> 
> 
> 
> 
> -- 
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
> 
> 
> 
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to