Hi,there was indeed a vertical tab in the corpus.
Thanks to both of you!Patricia



> From: [email protected]
> To: [email protected]; [email protected]
> Subject: Re: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong entry 
> was found (0) in position 1
> Date: Tue, 3 Jul 2012 16:57:39 +0100
> 
> Hi Patricia
> 
> It looks like you have some odd characters in your corpus - perhaps vertical 
> tabs. You could use xxd on the lm file to try to figure out what it is,
> 
> cheers - Barry
> 
> On Tuesday 03 July 2012 16:46:35 Nicholas Ruiz wrote:
> > Hi Patricia,
> > 
> > Unfortunately, I'm not so well versed in SRILM, so I'm not sure I can
> >  answer the question about the blank line appearing in your ARPA file. You
> >  can also try training your model directly with IRSTLM (in text format) and
> >  you can see if the blank line also appears.
> > 
> > tlm -tr=<corpus> -lm=[wb|msb] -n=3
> >  -o=complete_fr.truecased_unique_tok_irst.lm
> > 
> > (I'm not sure what you original params were for the SRI model)
> > wb=Witten-Bell Smoothing
> > msb=Modified Shift-Beta Smoothing
> > 
> > Best,
> > Nick
> > 
> > ________________________________
> > From: Patricia Helmich [[email protected]]
> > Sent: Tuesday, July 03, 2012 5:38 PM
> > To: Nicholas Ruiz
> > Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong
> >  entry was found (0) in position 1
> > 
> > Hi Nick,
> > 
> > ok, here are the first 10 lines of the BLM:
> > 
> > lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$ cat -n
> >  complete_fr.truecased_unique_tok_clean.blm | head 1  blmt 3 1091677
> >  13524189 23061450
> >      2  1091677
> >      3
> >          0
> >      4  ! 0
> >      5  " 0
> >      6  # 0
> >      7  $ 0
> >      8  % 0
> >      9  & 0
> >     10  ' 0
> > 
> > 
> > 
> > It seems that the third line causes the problems because I deleted it in a
> >  copy of the BLM
> > 
> > lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$ cat -n
> >  complete_fr.truecased_unique_tok_clean_copy.blm | head 1  blmt 3 1091677
> >  13524189 23061450
> >      2  1091677
> >      3  ! 0
> >      4  " 0
> >      5  # 0
> >      6  $ 0
> >      7  % 0
> >      8  & 0
> >      9  ' 0
> >     10  '00 0
> > 
> > and then I tried to compute the perplexity with the copy of the BLM and it
> >  worked well:
> > 
> > lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$
> >  /home/lingua/smt/irstlm/bin/compile-lm
> >  complete_fr.truecased_unique_tok_clean_copy.blm --eval
> >  /home/lingua/Patricia/Corpora/Corpora_Eval/devtest/nc-test2007.truecased.t
> > ok.fr inpfile: complete_fr.truecased_unique_tok_clean_copy.blm
> > loading up to the LM level 1000 (if any)
> > dub: 10000000
> > Language Model Type of complete_fr.truecased_unique_tok_clean_copy.blm is 1
> > blmt
> > loadbin()
> > lmtable::loadbin_dict()
> > dict->size(): 1091677
> > loadbin_level (level 1)
> > loading 1091677 1-grams
> > done (level1)
> > loadbin_level (level 2)
> > loading 13524189 2-grams
> > done (level2)
> > loadbin_level (level 3)
> > loading 23061450 3-grams
> > done (level3)
> > done
> > OOV code is 218080
> > Start Eval
> > OOV code: 218080
> > %% Nw=58714 PP=1.03 PPwp=0.03 Nbo=58713 Noov=105 OOV=0.18%
> > lmtable class statistics
> > levels 3
> > lev 1 entries 1091677 used mem 15.62Mb
> > lev 2 entries 13524189 used mem 193.47Mb
> > lev 3 entries 23061450 used mem 153.95Mb
> > total allocated mem 363.03Mb
> > total number of get and binary search calls
> > level 1 get: 58714 bsearch: 0
> > level 2 get: 58713 bsearch: 117425
> > level 3 get: 58712 bsearch: 0
> > 
> > 
> > In the LM, I have also this empty line
> > 
> > lingua@StatMT24:~/Patricia/Corpora/Corpora_Monoling_Complete/fr$ cat -n
> >  complete_fr.truecased_unique_tok_clean.lm | head 1
> >      2  \data\
> >      3  ngram 1=1091677
> >      4  ngram 2=13524189
> >      5  ngram 3=23061450
> >      6
> >      7  \1-grams:
> >      8  -7.154682
> >                                 -0.1456359
> >      9  -3.339167       !       -1.472732
> >     10  -2.43139        "       -0.733331
> > 
> > but in the phrase training or the perplexity computation with the LM, this
> >  does not cause any problems.
> > 
> > Also, I'm wondering why there is an entry for an empty line in the LM
> >  because I checked my french corpus and it does not contain any empty
> >  lines.
> > 
> > 
> > Best, Patricia
> > 
> > > From: [email protected]
> > > To: [email protected]
> > > Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong
> > > entry was found (0) in position 1 Date: Tue, 3 Jul 2012 14:59:57 +0000
> > >
> > > Hi Patricia,
> > >
> > > Could you also send me the top 10 lines of your binarized LM?
> > >
> > > head complete_fr.truecased_unique_tok_clean.blm
> > >
> > > Thanks,
> > > Nick
> > >
> > > ________________________________
> > > From: Patricia Helmich [[email protected]]
> > > Sent: Tuesday, July 03, 2012 4:40 PM
> > > To: Nicholas Ruiz; [email protected]
> > > Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong
> > > entry was found (0) in position 1
> > >
> > > Hi Nick,
> > >
> > > for
> > >
> > > /home/lingua/smt/irstlm/bin/compile-lm
> > > complete_fr.truecased_unique_tok_clean.lm --eval
> > > /home/lingua/Patricia/Corpora/Corpora_Eval/devtest/nc-test2007.truecased.
> > >tok.fr
> > >
> > > I get the following output:
> > >
> > > inpfile: complete_fr.truecased_unique_tok_clean.lm
> > > loading up to the LM level 1000 (if any)
> > > dub: 10000000
> > > Language Model Type of complete_fr.truecased_unique_tok_clean.lm is 1
> > > \data\
> > > loadtxt_ram()
> > > 1-grams: reading 1091677 entries
> > > done level1
> > > 2-grams: reading 13524189 entries
> > > ..done level2
> > > 3-grams: reading 23061450 entries
> > > ....done level3
> > > done
> > > OOV code is 218081
> > > OOV code is 218081
> > > Start Eval
> > > OOV code: 218081
> > > %% Nw=58714 PP=201.88 PPwp=5.70 Nbo=19233 Noov=105 OOV=0.18%
> > > lmtable class statistics
> > > levels 3
> > > lev 1 entries 1091677 used mem 15.62Mb
> > > lev 2 entries 13524189 used mem 193.47Mb
> > > lev 3 entries 23061450 used mem 153.95Mb
> > > total allocated mem 363.03Mb
> > > total number of get and binary search calls
> > > level 1 get: 3042 bsearch: 0
> > > level 2 get: 58713 bsearch: 23178875
> > > level 3 get: 58712 bsearch: 55672
> > >
> > >
> > >
> > > For
> > >
> > > /home/lingua/smt/irstlm/bin/compile-lm
> > > complete_fr.truecased_unique_tok_clean.blm --eval
> > > /home/lingua/Patricia/Corpora/Corpora_Eval/devtest/nc-test2007.truecased.
> > >tok.fr
> > >
> > > I get the same error as in the phrase training:
> > >
> > > inpfile: complete_fr.truecased_unique_tok_clean.blm
> > > loading up to the LM level 1000 (if any)
> > > dub: 10000000
> > > Language Model Type of complete_fr.truecased_unique_tok_clean.blm is 1
> > > blmt
> > > loadbin()
> > > lmtable::loadbin_dict()
> > > dictionary::loadtxt wrong entry was found (0) in position 1
> > >
> > >
> > >
> > > Best,
> > > Patricia
> > >
> > > > From: [email protected]
> > > > To: [email protected]; [email protected]
> > > > Subject: RE: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong
> > > > entry was found (0) in position 1 Date: Tue, 3 Jul 2012 13:29:26 +0000
> > > >
> > > > Hi Patricia,
> > > >
> > > > Could you try computing the perplexity of your binarized LM with
> > > > compile-lm?
> > > >
> > > > First on the ARPA format (SRILM):
> > > > /home/lingua/smt/irstlm/bin/compile-lm
> > > > complete_fr.truecased_unique_tok_clean.lm --eval <text-to-eval>
> > > >
> > > > and then on the binarized version (before your symbolic link):
> > > > /home/lingua/smt/irstlm/bin/compile-lm
> > > > complete_fr.truecased_unique_tok_clean.blm --eval <text-to-eval>
> > > >
> > > > It might be easier to debug by first looking at the direct output from
> > > > IRSTLM.
> > > >
> > > > Thanks,
> > > > Nick
> > > >
> > > >
> > > > ________________________________
> > > > From: [email protected] [[email protected]] on
> > > > behalf of Patricia Helmich [[email protected]] Sent: Tuesday,
> > > > July 03, 2012 3:07 PM
> > > > To: [email protected]
> > > > Subject: [Moses-support] IRSTLM - Error: dictionary::loadtxt wrong
> > > > entry was found (0) in position 1
> > > >
> > > > Hi,
> > > > I am using Moses in combination with SRILM and IRSTLM for several
> > > > language pairs. After building LMs with SRILM and training the phrase
> > > > model, I try to translate a sentence, for example:
> > > >
> > > > echo "this is a small house" | /home/lingua/smt/moses/bin/moses -f
> > > > model/moses.ini
> > > >
> > > > This works well for each language pair.
> > > >
> > > > Then I produce an IRSTLM binary LM for each language pair, for example:
> > > >
> > > > /home/lingua/smt/irstlm/bin/compile-lm
> > > > complete_fr.truecased_unique_tok_clean.lm
> > > > complete_fr.truecased_unique_tok_clean.blm ln -s
> > > > complete_fr.truecased_unique_tok_clean.blm
> > > > complete_fr.truecased_unique_tok_clean.blm.mm
> > > >
> > > > and I produce binary phrase tables and binary reordering tables:
> > > >
> > > > gzip -cd fr-en/f_en.e_fr/model/phrase-table.gz | LC_ALL=C sort |
> > > > /home/lingua/smt/moses/bin/processPhraseTable -ttable 0 0 - -nscores 5
> > > > -out fr-en/f_en.e_fr/model/phrase-table gzip -cd
> > > > fr-en/f_en.e_fr/model/reordering-table.wbe-msd-bidirectional-fe.gz |
> > > > LC_ALL=C sort | /home/lingua/smt/moses/bin/processLexicalTable -out
> > > > fr-en/f_en.e_fr/model/reordering-table
> > > >
> > > > Then I create a copy of moses.ini (->moses-bin.ini) and set
> > > > moses-bin.ini to use the binary files.
> > > >
> > > >
> > > > Now I try to translate a sentence with:
> > > >
> > > > echo "this is a small house" | TMP=/tmp
> > > > /home/lingua/smt/moses/bin/moses -v 2 -f model/moses-bin.ini
> > > >
> > > >
> > > > This works well for each language pair, except for the language pair f:
> > > > en, e: fr.
> > > >
> > > > The output is:
> > > >
> > > > Defined parameters (per moses.ini or switch):
> > > > config: model/moses-bin.ini
> > > > distortion-file: 0-0 wbe-msd-bidirectional-fe-allff 6
> > > > /home/lingua/Patricia/Corpora/Corpora_Biling/fr-en/f_en.e_fr/model/reor
> > > >dering-table distortion-limit: 6
> > > > input-factors: 0
> > > > lmodel-file: 1 0 3
> > > > /home/lingua/Patricia/Corpora/Corpora_Monoling_Complete/fr/complete_fr.
> > > >truecased_unique_tok_clean.blm.mm mapping: 0 T 0
> > > > ttable-file: 1 0 0 5
> > > > /home/lingua/Patricia/Corpora/Corpora_Biling/fr-en/f_en.e_fr/model/phra
> > > >se-table ttable-limit: 20
> > > > verbose: 2
> > > > weight-d: 0.3 0.3 0.3 0.3 0.3 0.3 0.3
> > > > weight-l: 0.5000
> > > > weight-t: 0.20 0.20 0.20 0.20 0.20
> > > > weight-w: -1
> > > > input type is: text input
> > > > Loading lexical distortion models...have 1 models
> > > > Creating lexical reordering...
> > > > weights: 0.300 0.300 0.300 0.300 0.300 0.300
> > > > binary file loaded, default OFF_T: -1
> > > > Start loading LanguageModel
> > > > /home/lingua/Patricia/Corpora/Corpora_Monoling_Complete/fr/complete_fr.
> > > >truecased_unique_tok_clean.blm.mm : [0.000] seconds In
> > > > LanguageModelIRST::Load: nGramOrder = 3
> > > > Language Model Type of
> > > > /home/lingua/Patricia/Corpora/Corpora_Monoling_Complete/fr/complete_fr.
> > > >truecased_unique_tok_clean.blm.mm is 1 blmt
> > > > loadbin()
> > > > lmtable::loadbin_dict()
> > > > dictionary::loadtxt wrong entry was found (0) in position 1
> > > >
> > > > I don't understand the reason for this error. Could you help me with
> > > > this problem?
> > > >
> > > > Thank you,
> > > > Patricia
> > 
> > _______________________________________________
> > Moses-support mailing list
> > [email protected]
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> > 
>  
> --
> Barry Haddow
> University of Edinburgh
> +44 (0) 131 651 3173
> 
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
                                          
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to