I think (M)GIZA++ expertise has faded over the years, but I'm hoping someone has some ideas about this.

A user start training with an extremely small TMX file with a few dozen of parallel segments. Our preparation tools reduced the parallel corpus to only 2 pairs (attached). The user was running the Windows version (mkcls.exe) but we verified the same error on Linux. The train-model.perl script failed in step 1 (log also attached). Specifically, the mkcls binary failed with this error message:

   Assertion failed!
   File: src/mkcls/StatVar.cpp, Line 110
   Expression: index>=0&&index<n

I've never seen this error with a respectable corpus size. So, I did some tests.

 * test 1, copy same sentences 14,500 times, got assertion (failed).
 * test 2, copy each of the pairs only 7,250 times, got assertion (failed).
 * test 3, added 4,000 unique pairs, no assertion (success).
 * test 4, reduced to 2 original + 3 new pairs, no assertion (success).
 * test 5, reduced to 2 original + 2 new pairs, assertion returned
   (failed).

It seems this assertion is linked to lack of variety in the training corpus. Can anyone confirm this observation? Has anyone ever experienced this error?

If no one's seen this with a larger corpus, a terminal failure due to lack of variety is probably good. Would it be accurate if we add an error message to the effect, "terminal error due to insufficient variety in the training corpus"?

Thanks for any ideas/suggestions.
Tom


***** STEP 1 **************************************************************
COMMAND LINE:
C:\Strawberry\perl\bin\perl.exe -w 
C:\Users\tahoar\workbench\slate-toolkit\scripts\training\train-model.perl ^
 --root-dir 
C:\Users\tahoar\workbench\slate-desktop\var\TRAININGS\smt-tm-de_de-en_us-DANIEL 
^
 --e de_de ^
 --do-steps 1 ^
 --f en_us ^
 --mgiza ^
 --config 
C:\Users\tahoar\workbench\slate-desktop\var\TRAININGS\smt-tm-de_de-en_us-DANIEL\model.giza.grow-diag-final-and\moses.7.en_us-de_de.ini
 ^
 --lm 0:11111:C:\placeholder.lm:9 ^
 --external-bin-dir C:\Users\tahoar\workbench\slate-toolkit\bin ^
 --max-phrase-length 7 ^
 --temp-dir 
c:\users\tahoar\appdata\local\temp\Slate_Desktop-7288\train-tm-0\trainer,train,210,1,train-tm\1
 ^
 --continue ^
 --reordering msd-bidirectional-fe ^
 --cores 7 ^
 --corpus 
C:\Users\tahoar\workbench\slate-desktop\var\TRAININGS\smt-tm-de_de-en_us-DANIEL\bitext
 ^
 --mgiza-cpus 7 ^
 --alignment grow-diag-final-and
Using SCRIPTS_ROOTDIR: C:\Users\tahoar\workbench\slate-toolkit\scripts
Using: $SPLIT_EXEC ="split.exe"
Using: $SORT_EXEC="sort.exe"
Using: $GZIP_EXEC="gzip.exe -q"
Using: $GUNZIP_EXEC="gzip.exe -q -d"
Using: $BZCAT_EXEC="bzcat.exe -q -d -c"
Using: $ZCAT_EXEC="gzip.exe -q -d -c"
Using: $CAT_EXEC="type"
Using DEBUG: 0
(1) preparing corpus @ Fri Mar 25 21:20:09 +0700 2016
(1.0) selecting factors @ Fri Mar 25 21:20:09 +0700 2016
(1.1) running mkcls @ Fri Mar 25 21:20:09 +0700 2016
Executing: C:\Users\tahoar\workbench\slate-toolkit\bin\mkcls.exe -c50 -n2 
-p"C:\Users\tahoar\workbench\slate-desktop\var\TRAININGS\smt-tm-de_de-en_us-DANIEL\bitext.en_us"
 
-V"C:\Users\tahoar\workbench\slate-desktop\var\TRAININGS\smt-tm-de_de-en_us-DANIEL\giza.de_de-en_us\en_us.vcb.classes"
 opt 1>&2

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
Assertion failed!

Program: C:\Users\tahoar\workbench\slate-toolkit\bin\mkcls.exe
File: src/mkcls/StatVar.cpp, Line 110

Expression: index>=0&&index<n
ERROR: Failed to execute C:\Users\tahoar\workbench\slate-toolkit\bin\mkcls.exe
failed  perl.exe
fail    train-model.perl step 1
beispiel : sonntag .
um sich anzumelden , wenden sie sich bitte an die örtliche vertretung der 
gesellschaft für verkehrssicherheit .
for example : sunday .
to register , contact your local branch of the road safety association .
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to