If you have a very small corpus at hand just use the witten-bell smoothing method. Do also not go beyond order 3. . Best, Marcello Marcello Federico FBK-irst Trento, Italy
----- Messaggio originale ----- Da: [EMAIL PROTECTED] <[EMAIL PROTECTED]> A: 'Philipp Koehn' <[EMAIL PROTECTED]> Cc: [email protected] <[email protected]> Inviato: Thu Jun 19 03:28:24 2008 Oggetto: [Moses-support] encoding for parallel corpus Hi I have a problem. I download the corpus of " factored-corpus.tgz" from the Moses page in which there is a file namely "pos.lm". I want to know how to train the file. I POS tagged my English sentences e.g. "the|DT light|NN was|VBD red|JJ .|."and extract the pos tag to get the sentence such as "DT NN VBD JJ .". Then I train such pos sentence by srilm with the following order: /////////////////////////////////////////////////////////////////// /home/srilm/bin/i686/ngram-count -order 3 -interpolate -kndiscount -text EN_pos.txt -lm pos.lm ~one of required modified KneserNey count-of-counts is zero error in discount estimator for order 1 /////////////////////////////////////////////////////////////////////// In such condition no lm file is generated. When I remove the parameters " -interpolate -kndiscount " ///////////////////////////////////////////////////////////////// /home/ srilm/bin/i686/ngram-count -order 3 -text EN_pos.txt -lm pos.lm warning: no singleton counts GT discounting disabled warning: discount coeff 1 is out of range: 0.666667 warning: discount coeff 2 is out of range: 0.800271 warning: discount coeff 3 is out of range: 0.439665 warning: discount coeff 4 is out of range: 0.918576 warning: discount coeff 6 is out of range: 0.860417 warning: discount coeff 7 is out of range: 0.900741 warning: discount coeff 1 is out of range: 2.25939 warning: discount coeff 3 is out of range: -0.0390595 warning: discount coeff 4 is out of range: 1.6028 warning: discount coeff 5 is out of range: 1.62952 warning: discount coeff 6 is out of range: -0.17675 BOW denominator for context "NN" is zero; scaling probabilities to sum to 1 BOW denominator for context "VB" is zero; scaling probabilities to sum to 1 BOW denominator for context "IN" is zero; scaling probabilities to sum to 1 //////////////////////////////////////////////////////////////////// In such condition a lm file is generated but when I execute the order" /////////////////////////////////////////////////////////////////// mert-moses.pl input ref moses/moses-cmd/src/moses model/moses.ini -nbest 200 --working-dir tuning --rootdir /home/moses_new/bin/moses-scripts/scripts-20080519-1755 " some error is /////////////////////////////////////////////////////////////// Loading table into memory...done. Created lexical orientation reordering Start loading LanguageModel /home/yqhe/iwslt2007/moses_new/enfactordata/lm/en.lm : [0.000] seconds Start loading LanguageModel /home/yqhe/iwslt2007/moses_new/enfactordata/lm/pos.lm : [1.000] seconds Finished loading LanguageModels : [1.000] seconds Start loading PhraseTable /home/yqhe/iwslt2007/moses_new/enfactordata/tuning/filtered/phrase-table.0-0 ,1.1 : [1.000] seconds Finished loading phrase tables : [3.000] seconds Created input-output object : [3.000] seconds Translating: 哦 那个 航班 是 C 三 零 六 。 moses: LanguageModelSRI.cpp:154: virtual float LanguageModelSRI::GetValue(const std::vector<const Word*, std::allocator<const Word*> >&, const void**, unsigned int*) const: Assertion `(*contextFactor[count-1])[factorType] != __null' failed. Aborted (core dumped) Exit code: 134 The decoder died. CONFIG WAS -w 0.000000 -lm 0.100000 0.100000 -d 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 0.100000 -tm 0.030000 0.020000 0.030000 0.020000 0.000000 ///////////////////////////////////////////////////////////////////// So I don't know how to train a lm file by srilm. Can you tell me how you train pos.lm? Even the specific ngram-count order. Best regards. He Yanqing _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
