Hi,
Your command looks reasonable. There are some differences in what
we're calculating.
MITLM says "Instead of estimating discount factors f [i] = D(i) from
count statistics, we can also tune them to minimize the development set
perplexity, which has been observed to improve performance [1]." This
is in fact better than what lmplz does.
IRSTLM isn't computing canonical modified Kneser-Ney. Your data seems
close to passing (it passed for orders 1-4), so my guess would be that
they're using a different formula (likely) or do not check for this
condition.
SRILM should encounter an error on this data. If it doesn't, then I'll
dig into the code.
If you pipe the data through sort -u, how much smaller does it get and
does that fix the issue? Does it contain e.g. long product names that
might mess up 5-gram statistics?
I guess the reasonable thing to do in this situation would be to turn
off the "modified" part of modified Kneser-Ney. Instead of separate
discounts for 1, 2, and 3+, it would make one discount for all counts.
At least for 5-grams in your case. I could implement this if you want.
Also, you're running a 64-bit machine, yes?
Kenneth
On 02/21/13 07:05, Darragh Whelan wrote:
> Hi everyone,
>
> I am trying to build a language model using lmplz but I am running into
> some problems.
>
> The error is:
>
>
> /adjust_counts.cc:50 in void
> lm::builder::<unnamed>::StatCollector::CalculateDiscounts() threw
> BadDiscountException because `discounts_[i].amount[j] < 0.0 ||
> discounts_[i].amount[j] > j'./
>
> /ERROR: 5-gram discount out of range for adjusted count 2: -5.03664/
>
> /Aborted/
>
> //
>
> I have followed the Moses manual closely and the command I used to run was:
>
> /bin/lmplz -o 5 -S 80% -T /tmp/ < /data/lm.fr >
> /engines/FR_FR/lm/lmplz.arpa
>
> I don’t think it is a problem with our data as I have been able to build
> a language model using IRSTLM and MITLM successfully.
>
> Could anyone please help us with getting lmplz to work with this data?
>
> Thanks,
>
> Darragh
>
> --
> Oracle <http://www.oracle.com/>
> Darragh Whelan | Software Engineer | +353 180 31922
>
> Oracle- WPTG
>
> East Point Business Park
>
> Clontarf
>
> Dublin
>
> Ireland
>
> Green Oracle <http://www.oracle.com/commitment>
>
>
>
> Oracle is committed to developing practices and products that help
> protect the environment
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support