Hi,there was indeed a vertical tab in the corpus. Thanks to both of you!Patricia
> From: [email protected] > To: [email protected]; [email protected] > Subject: Re: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong entry > was found (0) in position 1 > Date: Tue, 3 Jul 2012 16:57:39 +0100 > > Hi Patricia > > It looks like you have some odd characters in your corpus - perhaps vertical > tabs. You could use xxd on the lm file to try to figure out what it is, > > cheers - Barry > > On Tuesday 03 July 2012 16:46:35 Nicholas Ruiz wrote: > > Hi Patricia, > > > > Unfortunately, I'm not so well versed in SRILM, so I'm not sure I can > > answer the question about the blank line appearing in your ARPA file. You > > can also try training your model directly with IRSTLM (in text format) and > > you can see if the blank line also appears. > > > > tlm -tr=<corpus> -lm=[wb|msb] -n=3 > > -o=complete_fr.truecased_unique_tok_irst.lm > > > > (I'm not sure what you original params were for the SRI model) > > wb=Witten-Bell Smoothing > > msb=Modified Shift-Beta Smoothing > > > > Best, > > Nick > > > > ________________________________ > > From: Patricia Helmich [[email protected]] > > Sent: Tuesday, July 03, 2012 5:38 PM > > To: Nicholas Ruiz > > Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong > > entry was found (0) in position 1 > > > > Hi Nick, > > > > ok, here are the first 10 lines of the BLM: > > > > lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$ cat -n > > complete_fr.truecased_unique_tok_clean.blm | head 1 blmt 3 1091677 > > 13524189 23061450 > > 2 1091677 > > 3 > > 0 > > 4 ! 0 > > 5 " 0 > > 6 # 0 > > 7 $ 0 > > 8 % 0 > > 9 & 0 > > 10 ' 0 > > > > > > > > It seems that the third line causes the problems because I deleted it in a > > copy of the BLM > > > > lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$ cat -n > > complete_fr.truecased_unique_tok_clean_copy.blm | head 1 blmt 3 1091677 > > 13524189 23061450 > > 2 1091677 > > 3 ! 0 > > 4 " 0 > > 5 # 0 > > 6 $ 0 > > 7 % 0 > > 8 & 0 > > 9 ' 0 > > 10 '00 0 > > > > and then I tried to compute the perplexity with the copy of the BLM and it > > worked well: > > > > lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$ > > /home/lingua/smt/irstlm/bin/compile-lm > > complete_fr.truecased_unique_tok_clean_copy.blm --eval > > /home/lingua/Patricia/Corpora/Corpora_Eval/devtest/nc-test2007.truecased.t > > ok.fr inpfile: complete_fr.truecased_unique_tok_clean_copy.blm > > loading up to the LM level 1000 (if any) > > dub: 10000000 > > Language Model Type of complete_fr.truecased_unique_tok_clean_copy.blm is 1 > > blmt > > loadbin() > > lmtable::loadbin_dict() > > dict->size(): 1091677 > > loadbin_level (level 1) > > loading 1091677 1-grams > > done (level1) > > loadbin_level (level 2) > > loading 13524189 2-grams > > done (level2) > > loadbin_level (level 3) > > loading 23061450 3-grams > > done (level3) > > done > > OOV code is 218080 > > Start Eval > > OOV code: 218080 > > %% Nw=58714 PP=1.03 PPwp=0.03 Nbo=58713 Noov=105 OOV=0.18% > > lmtable class statistics > > levels 3 > > lev 1 entries 1091677 used mem 15.62Mb > > lev 2 entries 13524189 used mem 193.47Mb > > lev 3 entries 23061450 used mem 153.95Mb > > total allocated mem 363.03Mb > > total number of get and binary search calls > > level 1 get: 58714 bsearch: 0 > > level 2 get: 58713 bsearch: 117425 > > level 3 get: 58712 bsearch: 0 > > > > > > In the LM, I have also this empty line > > > > lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$ cat -n > > complete_fr.truecased_unique_tok_clean.lm | head 1 > > 2 \data\ > > 3 ngram 1=1091677 > > 4 ngram 2=13524189 > > 5 ngram 3=23061450 > > 6 > > 7 \1-grams: > > 8 -7.154682 > > -0.1456359 > > 9 -3.339167 ! -1.472732 > > 10 -2.43139 " -0.733331 > > > > but in the phrase training or the perplexity computation with the LM, this > > does not cause any problems. > > > > Also, I'm wondering why there is an entry for an empty line in the LM > > because I checked my french corpus and it does not contain any empty > > lines. > > > > > > Best, Patricia > > > > > From: [email protected] > > > To: [email protected] > > > Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong > > > entry was found (0) in position 1 Date: Tue, 3 Jul 2012 14:59:57 +0000 > > > > > > Hi Patricia, > > > > > > Could you also send me the top 10 lines of your binarized LM? > > > > > > head complete_fr.truecased_unique_tok_clean.blm > > > > > > Thanks, > > > Nick > > > > > > ________________________________ > > > From: Patricia Helmich [[email protected]] > > > Sent: Tuesday, July 03, 2012 4:40 PM > > > To: Nicholas Ruiz; [email protected] > > > Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong > > > entry was found (0) in position 1 > > > > > > Hi Nick, > > > > > > for > > > > > > /home/lingua/smt/irstlm/bin/compile-lm > > > complete_fr.truecased_unique_tok_clean.lm --eval > > > /home/lingua/Patricia/Corpora/Corpora_Eval/devtest/nc-test2007.truecased. > > >tok.fr > > > > > > I get the following output: > > > > > > inpfile: complete_fr.truecased_unique_tok_clean.lm > > > loading up to the LM level 1000 (if any) > > > dub: 10000000 > > > Language Model Type of complete_fr.truecased_unique_tok_clean.lm is 1 > > > \data\ > > > loadtxt_ram() > > > 1-grams: reading 1091677 entries > > > done level1 > > > 2-grams: reading 13524189 entries > > > ..done level2 > > > 3-grams: reading 23061450 entries > > > ....done level3 > > > done > > > OOV code is 218081 > > > OOV code is 218081 > > > Start Eval > > > OOV code: 218081 > > > %% Nw=58714 PP=201.88 PPwp=5.70 Nbo=19233 Noov=105 OOV=0.18% > > > lmtable class statistics > > > levels 3 > > > lev 1 entries 1091677 used mem 15.62Mb > > > lev 2 entries 13524189 used mem 193.47Mb > > > lev 3 entries 23061450 used mem 153.95Mb > > > total allocated mem 363.03Mb > > > total number of get and binary search calls > > > level 1 get: 3042 bsearch: 0 > > > level 2 get: 58713 bsearch: 23178875 > > > level 3 get: 58712 bsearch: 55672 > > > > > > > > > > > > For > > > > > > /home/lingua/smt/irstlm/bin/compile-lm > > > complete_fr.truecased_unique_tok_clean.blm --eval > > > /home/lingua/Patricia/Corpora/Corpora_Eval/devtest/nc-test2007.truecased. > > >tok.fr > > > > > > I get the same error as in the phrase training: > > > > > > inpfile: complete_fr.truecased_unique_tok_clean.blm > > > loading up to the LM level 1000 (if any) > > > dub: 10000000 > > > Language Model Type of complete_fr.truecased_unique_tok_clean.blm is 1 > > > blmt > > > loadbin() > > > lmtable::loadbin_dict() > > > dictionary::loadtxt wrong entry was found (0) in position 1 > > > > > > > > > > > > Best, > > > Patricia > > > > > > > From: [email protected] > > > > To: [email protected]; [email protected] > > > > Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong > > > > entry was found (0) in position 1 Date: Tue, 3 Jul 2012 13:29:26 +0000 > > > > > > > > Hi Patricia, > > > > > > > > Could you try computing the perplexity of your binarized LM with > > > > compile-lm? > > > > > > > > First on the ARPA format (SRILM): > > > > /home/lingua/smt/irstlm/bin/compile-lm > > > > complete_fr.truecased_unique_tok_clean.lm --eval <text-to-eval> > > > > > > > > and then on the binarized version (before your symbolic link): > > > > /home/lingua/smt/irstlm/bin/compile-lm > > > > complete_fr.truecased_unique_tok_clean.blm --eval <text-to-eval> > > > > > > > > It might be easier to debug by first looking at the direct output from > > > > IRSTLM. > > > > > > > > Thanks, > > > > Nick > > > > > > > > > > > > ________________________________ > > > > From: [email protected] [[email protected]] on > > > > behalf of Patricia Helmich [[email protected]] Sent: Tuesday, > > > > July 03, 2012 3:07 PM > > > > To: [email protected] > > > > Subject: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong > > > > entry was found (0) in position 1 > > > > > > > > Hi, > > > > I am using Moses in combination with SRILM and IRSTLM for several > > > > language pairs. After building LMs with SRILM and training the phrase > > > > model, I try to translate a sentence, for example: > > > > > > > > echo "this is a small house" | /home/lingua/smt/moses/bin/moses -f > > > > model/moses.ini > > > > > > > > This works well for each language pair. > > > > > > > > Then I produce an IRSTLM binary LM for each language pair, for example: > > > > > > > > /home/lingua/smt/irstlm/bin/compile-lm > > > > complete_fr.truecased_unique_tok_clean.lm > > > > complete_fr.truecased_unique_tok_clean.blm ln -s > > > > complete_fr.truecased_unique_tok_clean.blm > > > > complete_fr.truecased_unique_tok_clean.blm.mm > > > > > > > > and I produce binary phrase tables and binary reordering tables: > > > > > > > > gzip -cd fr-en/f_en.e_fr/model/phrase-table.gz | LC_ALL=C sort | > > > > /home/lingua/smt/moses/bin/processPhraseTable -ttable 0 0 - -nscores 5 > > > > -out fr-en/f_en.e_fr/model/phrase-table gzip -cd > > > > fr-en/f_en.e_fr/model/reordering-table.wbe-msd-bidirectional-fe.gz | > > > > LC_ALL=C sort | /home/lingua/smt/moses/bin/processLexicalTable -out > > > > fr-en/f_en.e_fr/model/reordering-table > > > > > > > > Then I create a copy of moses.ini (->moses-bin.ini) and set > > > > moses-bin.ini to use the binary files. > > > > > > > > > > > > Now I try to translate a sentence with: > > > > > > > > echo "this is a small house" | TMP=/tmp > > > > /home/lingua/smt/moses/bin/moses -v 2 -f model/moses-bin.ini > > > > > > > > > > > > This works well for each language pair, except for the language pair f: > > > > en, e: fr. > > > > > > > > The output is: > > > > > > > > Defined parameters (per moses.ini or switch): > > > > config: model/moses-bin.ini > > > > distortion-file: 0-0 wbe-msd-bidirectional-fe-allff 6 > > > > /home/lingua/Patricia/Corpora/Corpora_Biling/fr-en/f_en.e_fr/model/reor > > > >dering-table distortion-limit: 6 > > > > input-factors: 0 > > > > lmodel-file: 1 0 3 > > > > /home/lingua/Patricia/Corpora/Corpora_Monoling_Complete/fr/complete_fr. > > > >truecased_unique_tok_clean.blm.mm mapping: 0 T 0 > > > > ttable-file: 1 0 0 5 > > > > /home/lingua/Patricia/Corpora/Corpora_Biling/fr-en/f_en.e_fr/model/phra > > > >se-table ttable-limit: 20 > > > > verbose: 2 > > > > weight-d: 0.3 0.3 0.3 0.3 0.3 0.3 0.3 > > > > weight-l: 0.5000 > > > > weight-t: 0.20 0.20 0.20 0.20 0.20 > > > > weight-w: -1 > > > > input type is: text input > > > > Loading lexical distortion models...have 1 models > > > > Creating lexical reordering... > > > > weights: 0.300 0.300 0.300 0.300 0.300 0.300 > > > > binary file loaded, default OFF_T: -1 > > > > Start loading LanguageModel > > > > /home/lingua/Patricia/Corpora/Corpora_Monoling_Complete/fr/complete_fr. > > > >truecased_unique_tok_clean.blm.mm : [0.000] seconds In > > > > LanguageModelIRST::Load: nGramOrder = 3 > > > > Language Model Type of > > > > /home/lingua/Patricia/Corpora/Corpora_Monoling_Complete/fr/complete_fr. > > > >truecased_unique_tok_clean.blm.mm is 1 blmt > > > > loadbin() > > > > lmtable::loadbin_dict() > > > > dictionary::loadtxt wrong entry was found (0) in position 1 > > > > > > > > I don't understand the reason for this error. Could you help me with > > > > this problem? > > > > > > > > Thank you, > > > > Patricia > > > > _______________________________________________ > > Moses-support mailing list > > [email protected] > > http://mailman.mit.edu/mailman/listinfo/moses-support > > > > -- > Barry Haddow > University of Edinburgh > +44 (0) 131 651 3173 > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
