On 15/07/2013 09:10, Tom Hoar wrote:
Thanks, Hieu.
Re the = character. Can I assume this is new in the trunk and does not
affect RELEASE-1.0? That one change will create huge problems with us
to move away from release 1.
correct, it doesn't affect release-1.0, just the code in the current
github repos
Re the mismatch error. This corpus is training a recaser model. We
copied the (e) corpus and applied tokenization, lowercasing, etc. to
the (f) copy. So, it doesn't make any sense. The exact same (e) corpus
was part of a language pair and the standard SMT model trained without
the error.
This train-moses.perl finished and tuning is running without error.
However, the current 8th run has a BLEU score of "only" 0.9846. Based
on previous models, I expected greater than 0.99. This might not sound
important, but in this case, it is a huge the difference. 0.984 = 70%
of the test segments match reference. 0.99 >= 95% test match reference.
Thanks again. I'll look deeper.
On 07/15/2013 02:08 PM, Hieu Hoang wrote:
Sentence mismatch error is definitely an important error. Is there a
problem with your corpus? Dodgy encoding, Windows carriage return,
range out of disk space etc?
Also, don't use the = character in directory name any more. It's
being used to separate key=value pairs. eg.in <http://eg.in> the
refactored ini file, a phrase-table entry
0 0 0 5 file
becomes
PhraseDictionaryMemory path=file input-factor=0 output-factor=0
It's not the cause of your errror but it will affect it further down
the line. Sorry, should highlight this potential problem a little more
On 15 July 2013 02:07, Tom Hoar <[email protected]
<mailto:[email protected]>> wrote:
Here is the command line when I ran train-model.perl.
/usr/bin/perl -w /usr/local/bin/train-model.perl \
--do-steps 3 \
--cores 6 \
--corpus /opt/domy/BUILDS/lm/es-test-retokr/bitext \
--e en_us \
--external-bin-dir /usr/local/bin \
--f es \
--lm 0:0:/tmp/placeholder.lm:0 \
--max-phrase-length 10 \
--mgiza \
--mgiza-cpus 6 \
--model-dir
/opt/domy/TRAININGS/merts/mert-t=es-l=es-test-retokr-T=irstlmken-n=12-a=giza-g=10
\
--root-dir
/opt/domy/TRAININGS/merts/mert-t=es-l=es-test-retokr-T=irstlmken-n=12-a=giza-g=10
The log output has a non-fatal error "Sentence mismatch error!" Any
ideas about the cause or importance?
(3) generate word alignment @ Mon Jul 15 07:44:56 ICT 2013
Combining forward and inverted alignment from files:
/opt/domy/TRAININGS/merts/mert-t=es-l=es-test-retokr-T=irstlmken-n=12-a=giza-g=10/giza.es-en_us/es-en_us.A3.final.{bz2,gz}
/opt/domy/TRAININGS/merts/mert-t=es-l=es-test-retokr-T=irstlmken-n=12-a=giza-g=10/giza.en_us-es/en_us-es.A3.final.{bz2,gz}
Executing: mkdir -p
/opt/domy/TRAININGS/merts/mert-t=es-l=es-test-retokr-T=irstlmken-n=12-a=giza-g=10
Executing:
/usr/local/lib/mosesdecoder/scripts/training/giza2bal.pl
<http://giza2bal.pl> -d
"gzip -cd
/opt/domy/TRAININGS/merts/mert-t=es-l=es-test-retokr-T=irstlmken-n=12-a=giza-g=10/giza.en_us-es/en_us-es.A3.final.gz"
-i "gzip -cd
/opt/domy/TRAININGS/merts/mert-t=es-l=es-test-retokr-T=irstlmken-n=12-a=giza-g=10/giza.es-en_us/es-en_us.A3.final.gz"
|/usr/local/lib/mosesdecoder/scripts/../bin/symal -alignment="grow"
-diagonal="yes" -final="yes" -both="no" >
/opt/domy/TRAININGS/merts/mert-t=es-l=es-test-retokr-T=irstlmken-n=12-a=giza-g=10/aligned.grow-diag-final
symal: computing grow alignment: diagonal (1) final
(1)both-uncovered (0)
Sentence mismatch error! Line #1179689
skip=<0> counts=<1227038>
_______________________________________________
Moses-support mailing list
[email protected] <mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support
--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support