there are many factors here.  firstly, the randomised LM makes errors
as a function of the false positive rate and the values (quantisation)
level.  roughly, the smaller these parameters are, the smaller your LM
will be, but there may be a performance drop.

secondly, the default count-based smoothing methods are only good when
you use enormous quantities of data --look at the Google LM paper
where they show that Stupid backoff approaches K-N smoothing.

if you really want the best performance from moderate amounts of data
(50 million lines is small:  i have used 1 billion sentences) then you
can get SRILM to produce an ARPA file as normal.  (this is the result
of ngram-count).  Randlm can convert an arpa file into a randomised
format.  what this means is that RandLM will use Kneser-Ney smoothing
and assuming reasonable error rates, your translation performance
should be near identical to when using the SRILM

Miles

2009/4/16 Michael Zuckerman <[email protected]>:
> Hi,
>
> We used moses with randlm - we took a very big corpus of ~50 million lines
> for the language model and processed it with randlm. Then we compared the
> results with moses run with srilm used on much smaller corpus. Surprisingly,
> srilm gave much better results (better translation quality), although used
> on much smaller corpus. Both lm's ran on 5-grams.
> These results were repeated in different language pairs (german - english,
> russian - english, spanish - english etc.)
> Could you please explain these results ?
>
> Thanks,
>     Michael.
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to