Hiya

On 29/03/2016 14:13, Tom Hoar wrote:
I think (M)GIZA++ expertise has faded over the years, but I'm hoping someone has some ideas about this.

A user start training with an extremely small TMX file with a few dozen of parallel segments. Our preparation tools reduced the parallel corpus to only 2 pairs (attached). The user was running the Windows version (mkcls.exe) but we verified the same error on Linux. The train-model.perl script failed in step 1 (log also attached). Specifically, the mkcls binary failed with this error message:

    Assertion failed!
    File: src/mkcls/StatVar.cpp, Line 110
    Expression: index>=0&&index<n

I've never seen this error with a respectable corpus size. So, I did some tests.

  * test 1, copy same sentences 14,500 times, got assertion (failed).
  * test 2, copy each of the pairs only 7,250 times, got assertion
    (failed).
  * test 3, added 4,000 unique pairs, no assertion (success).
  * test 4, reduced to 2 original + 3 new pairs, no assertion (success).
  * test 5, reduced to 2 original + 2 new pairs, assertion returned
    (failed).

It seems this assertion is linked to lack of variety in the training corpus. Can anyone confirm this observation? Has anyone ever experienced this error?

If no one's seen this with a larger corpus, a terminal failure due to lack of variety is probably good. Would it be accurate if we add an error message to the effect, "terminal error due to insufficient variety in the training corpus"?
Be my guest. You're probably not the 1st to encounter these edge cases, but the first to have to deal with them.

Thanks for any ideas/suggestions.
Tom




_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to