there are many factors here. firstly, the randomised LM makes errors as a function of the false positive rate and the values (quantisation) level. roughly, the smaller these parameters are, the smaller your LM will be, but there may be a performance drop.
secondly, the default count-based smoothing methods are only good when you use enormous quantities of data --look at the Google LM paper where they show that Stupid backoff approaches K-N smoothing. if you really want the best performance from moderate amounts of data (50 million lines is small: i have used 1 billion sentences) then you can get SRILM to produce an ARPA file as normal. (this is the result of ngram-count). Randlm can convert an arpa file into a randomised format. what this means is that RandLM will use Kneser-Ney smoothing and assuming reasonable error rates, your translation performance should be near identical to when using the SRILM Miles 2009/4/16 Michael Zuckerman <[email protected]>: > Hi, > > We used moses with randlm - we took a very big corpus of ~50 million lines > for the language model and processed it with randlm. Then we compared the > results with moses run with srilm used on much smaller corpus. Surprisingly, > srilm gave much better results (better translation quality), although used > on much smaller corpus. Both lm's ran on 5-grams. > These results were repeated in different language pairs (german - english, > russian - english, spanish - english etc.) > Could you please explain these results ? > > Thanks, > Michael. > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
