What would constitute duplicated in this context? The number of duplicated
lines in the document is relatively small, but it's possible some of the
lines have similar text.

    $ wc lm_data.en
     1876364 21359196 96962517 lm_data.en
    $ sort lm_data.en | uniq > lm_data_uniq.en
    $ wc lm_data_uniq.en
     1487703 15801025 71344598 lm_data_uniq.en

I'd have thought there should be enough unique data in there though, as the
file is a combined version of the following datasets from OPUS:

* GNOME
* OpenSubtitles 2018
* Tanzil
* Tatoeba
* Ubuntu

Thanks,
James

On Mon, 3 Dec 2018 at 11:58, Kenneth Heafield <[email protected]> wrote:

> Hi,
>
>     If I had to guess, you have a lot of duplicated text?
>
> Kenneth
> On 12/3/18 11:23 AM, James Baker wrote:
>
> Morning,
>
> I've been trying to train a language model using the following command:
>
>     /opt/model-builder/mosesdecoder/bin/lmplz -o 5 -S 80% -T /tmp <
> lm_data.en > model.lm
>
> But I'm getting the following error:
>
>     === 1/5 Counting and sorting n-grams ===
>     Reading /opt/model-builder/training/lm_data.en
>
> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
>
> ****************************************************************************************************
>     Unigram tokens 21187448 types 117756
>     === 2/5 Calculating and sorting adjusted counts ===
>     Chain sizes: 1:1413072 2:5151762432 3:9659554816 4:15455287296
> 5:22538960896
>     terminate called after throwing an instance of
> 'lm::builder::BadDiscountException'
>     what(): /opt/model-builder/mosesdecoder/lm/builder/adjust_counts.cc:61
> in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const
> lm::builder::DiscountConfig&) threw BadDiscountException because
> `discounts_[i].amount[j] < 0.0 || discounts_[i].amount[j] > j'.
>     ERROR: 5-gram discount out of range for adjusted count 2: -6.80247
>
> The data I'm training on has come from the OPUS project. I found some
> references online to issues when there isn't enough training data, but I
> think I have sufficient data and have previously trained on a lot less (and
> even on a subset of my current data):
>
>     $ wc lm_data.en
>     1874495 21187448 96148754 lm_data.en
>
> Any ideas what might be causing the problem?
>
> James
>
> _______________________________________________
> Moses-support mailing 
> [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to