Hi, I agree with you, it's kind of weird. As you said, I used "compile-lm" in order to have my SRI language model in a binary format. My first attemp was to run the decoder compiled with IRSTLM, but I had the segmentation fault error.
Then I ran the decoder compiled with SRILM with the following: "0 0 5 /home/esca/ESCA/lm/ca.blm". I managed to run the decoder, but the translation wasn't good at all. It was my mistake, as it seems that a language model in IRST binary format wasn't supposed to work on a decoder compiled with SRILM. My last attemp was to use SRI's binary format. According to SRI's FAQ (http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html) I rebuilt my language model with the command "ngram-count ... -lm /NEWLM/ -write-binary-lm". I managed to run the decoder and the output was ok. Its size was lower than the unbinarized SRI language model, and its loading time was lower, but it still took an average of 7 seconds on a quad core system. My target was to minimize the load of the translation and reordering tables and the LM. The first ones load almost instantly, but not the LM. I would like to know which are the differences between IRST and SRI binary formats, and if there's a better one. I first had the idea that the only way to have a binary LM was to use IRST tools, as SRI's method isn't mentioned on Moses documentation. Is there any reason to not having to use SRI binary format? Thanks for your help. Regards, Miguel Philipp Koehn wrote: > Hi, > > this is very weird. You are using the 'irstlm/src/compile-lm' command, are > you? > I was first a bit confused (actually still am), because there is also > a SRILM binary > format. > > -phi > > On Wed, Jul 23, 2008 at 10:50 AM, Miguel José Hernández Vidal > <[EMAIL PROTECTED]> wrote: > >> Hi Philipp, >> >> Thanks for your advice. Maybe I've done something wrong, although I followed >> Moses' documentation guidelines. >> >> First, I compiled separately a new Moses environment '--with-irstlm'. >> Next I ran the following in order to have a binarized version of my SRI >> language model: >> $ ./compile-lm corpus.ca.lm ca.blm >> >> Then I updated my moses.ini with the new settings: >> 1 0 5 /home/esca/ESCA/lm/ca.blm >> >> At last, I ran moses compiled with irstlm version and I had the >> 'segmentation fault' error. >> >> >> I managed to run the binarized SRI model in the following way: >> >> After 'compile-lm' I updated moses.ini: >> 0 0 5 /home/esca/ESCA/lm/ca.blm >> >> And then I ran moses (compiled with SRILM) without any errors. >> >> >> I thought binarized language models had to be decoded with the IRST compiled >> version of Moses. Am I wrong? >> >> Regards, >> Miguel >> >> Philipp Koehn wrote: >> >>> Hi, >>> >>> To use the binarized IRST LM, you just need to compile the SRILM LM, >>> no need to train the model with IRST tools. See Moses documentation >>> for details. >>> >>> -phi >>> >>> On Tue, Jul 22, 2008 at 12:31 PM, Miguel José Hernández Vidal >>> <[EMAIL PROTECTED]> wrote: >>> >>> >>>> I've also tried to run moses with a binarized (with compile-lm) SRI >>>> language model. When I run the decoder I see a segmentation fault error: >>>> >>>> >>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> [EMAIL PROTECTED]:~$ ~/moses/moses-cmd/src/moses -config >>>> ~/ESCA/model/moses.ini >>>> -input-file ~/ESCA/tuning/input > ~/ESCA/evaluation/output >>>> Defined parameters (per moses.ini or switch): >>>> config: /home/esca/ESCA/model/moses.ini >>>> distortion-file: 0-0 msd-bidirectional-fe 6 >>>> /home/esca/ESCA/model/reordering >>>> distortion-limit: 6 >>>> input-factors: 0 >>>> input-file: /home/esca/ESCA/tuning/input >>>> lmodel-file: 1 0 5 /home/esca/ESCA/lm/ca.blm >>>> mapping: 0 T 0 >>>> ttable-file: 0 0 5 /home/esca/ESCA/model/phrase-table >>>> ttable-limit: 20 >>>> weight-d: 0.3 0.3 0.3 0.3 0.3 0.3 0.3 >>>> weight-l: 0.5000 >>>> weight-t: 0.2 0.2 0.2 0.2 0.2 >>>> weight-w: -1 >>>> Loading lexical distortion models... >>>> have 1 models >>>> Creating lexical reordering... >>>> weights: 0.300 0.300 0.300 0.300 0.300 0.300 >>>> binary file loaded, default OFF_T: -1 >>>> Created lexical orientation reordering >>>> Start loading LanguageModel /home/esca/ESCA/lm/ca.blm : [1.000] seconds >>>> In LanguageModelIRST::Load: nGramOrder = 5 >>>> Loading LM file (no MAP) >>>> blmt >>>> loadbin() >>>> loading 321187 1-grams >>>> loading 4548952 2-grams >>>> loading 2785668 3-grams >>>> loading 2501764 4-grams >>>> loading 1741048 5-grams >>>> done >>>> OOV code is 37189 >>>> IRST: m_unknownId=37189 >>>> Fallo de segmentación (core dumped) #SEGMENTATION FAULT >>>> >>>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> >>>> I am using binarized phrase and reordering tables, but they worked fine >>>> when I build them with my old SRILM system. >>>> >>>> Thanks for your help. >>>> >>>> Regards, >>>> >>>> Miguel >>>> >>>> Miguel José Hernández Vidal wrote: >>>> >>>> >>>>> Hi mailing, >>>>> >>>>> I am trying to build my lm with IRST toolkit. First, I've added <s> >>>>> tags with 'add-start-end.sh' and, obviously, have my data tokenized & >>>>> lowercased. >>>>> >>>>> When I run 'build-lm.sh' it looks like it works fine, but at the end >>>>> of the process no output file is found. Here's the log: >>>>> >>>>> >>>>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>>> >>>>> [EMAIL PROTECTED]:~/irstlm/bin$ bash build-lm.sh -i ~/corpus/tag.es -o >>>>> ~/corpus/ca.lm -n 3 -k 5 -s kneser-ney >>>>> Cleaning temporary directory stat >>>>> Extracting dictionary from training corpus >>>>> Splitting dictionary into 5 lists >>>>> Extracting n-gram statistics for each word list >>>>> dict.000 >>>>> dict.001 >>>>> dict.002 >>>>> dict.003 >>>>> dict.004 >>>>> Estimating language models for each word list >>>>> dict.000 >>>>> dict.001 >>>>> dict.002 >>>>> dict.003 >>>>> dict.004 >>>>> Merging language models into /home/esca/corpus/ca.lm >>>>> Cleaning temporary directory stat >>>>> >>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ >>>>> >>>>> >>>>> I've tried with different corpus sizes, but it didn't work either. >>>>> btw, I am running the scripts under Ubuntu 7.04 32bit. >>>>> >>>>> Regards, >>>>> >>>>> Miguel >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>>> >>>> >>>> >>> >> > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
