Hiya
On 29/03/2016 14:13, Tom Hoar wrote:
I think (M)GIZA++ expertise has faded over the years, but I'm hoping
someone has some ideas about this.
A user start training with an extremely small TMX file with a few
dozen of parallel segments. Our preparation tools reduced the parallel
corpus to only 2 pairs (attached). The user was running the Windows
version (mkcls.exe) but we verified the same error on Linux. The
train-model.perl script failed in step 1 (log also attached).
Specifically, the mkcls binary failed with this error message:
Assertion failed!
File: src/mkcls/StatVar.cpp, Line 110
Expression: index>=0&&index<n
I've never seen this error with a respectable corpus size. So, I did
some tests.
* test 1, copy same sentences 14,500 times, got assertion (failed).
* test 2, copy each of the pairs only 7,250 times, got assertion
(failed).
* test 3, added 4,000 unique pairs, no assertion (success).
* test 4, reduced to 2 original + 3 new pairs, no assertion (success).
* test 5, reduced to 2 original + 2 new pairs, assertion returned
(failed).
It seems this assertion is linked to lack of variety in the training
corpus. Can anyone confirm this observation? Has anyone ever
experienced this error?
If no one's seen this with a larger corpus, a terminal failure due to
lack of variety is probably good. Would it be accurate if we add an
error message to the effect, "terminal error due to insufficient
variety in the training corpus"?
Be my guest. You're probably not the 1st to encounter these edge cases,
but the first to have to deal with them.
Thanks for any ideas/suggestions.
Tom
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support