Hi, remove the IGNORE here: [CORPUS:train1] IGNORE
and add an IGNORE here: [LM:nc] Also, your current configuration does not have a surface word language model. You can do this, but I would expect better results with one. -phi On Sat, Jan 30, 2016 at 2:28 AM, Sunayana Gawde <[email protected]> wrote: > Sir, > > Here is the corpus section of my config file: > > [CORPUS] > > ### long sentences are filtered out, since they slow down GIZA++ > # and are a less reliable source of data. set here the maximum > # length of a sentence > # > max-sentence-length = 50 > > [CORPUS:train1] IGNORE > > ### command to run to get raw corpus files > # > #get-corpus-script = /home/development/sunayana/POS-eng-kon/corpus > ### raw corpus files (untokenized, but sentence aligned) > # > #raw-stem = $wmt12-data/training/europarl-v7.$pair-extension > > ### tokenized corpus files (may contain long sentences) > # > #tokenized-stem = > > ### if sentence filtering should be skipped, > # point to the clean training data > # > clean-stem = $wmt12-data/train > > ### if corpus preparation should be skipped, > # point to the prepared training data > # > #lowercased-stem = > > [CORPUS:nc] IGNORE > #raw-stem = $wmt12-data/training/news-commentary-v7.$pair-extension > > [CORPUS:un] IGNORE > #raw-stem = $wmt12-data/training/undoc.2000.$pair-extension > > > --------------------------------------------------------------------------------------------- > And here is my LM section: > > # srilm > lm-training = $srilm-dir/ngram-count > settings = "-interpolate -kndiscount -unk" > > # order of the language model > order = 5 > > ### tool to be used for training randomized language model from scratch > # (more commonly, a SRILM is trained) > # > #rlm-training = "$randlm-dir/buildlm -falsepos 8 -values 8" > > ### script to use for binary table format for irstlm or kenlm > # (default: no binarization) > > # irstlm > #lm-binarizer = $irstlm-dir/compile-lm > > # kenlm, also set type to 8 > lm-binarizer = $moses-bin-dir/build_binary > type = 8 > > ### script to create quantized language model format (irstlm) > # (default: no quantization) > # > #lm-quantizer = $irstlm-dir/quantize-lm > > ### script to use for converting into randomized table format > # (default: no randomization) > # > #lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8" > > ### each language model to be used has its own section here > > [LM:europarl] IGNORE > > ### command to run to get raw corpus files > # > #get-corpus-script = "" > > ### raw corpus (untokenized) > # > #raw-corpus = $wmt12-data/training/europarl-v7.$output-extension > > ### tokenized corpus files (may contain long sentences) > # > #tokenized-corpus = > > ### if corpus preparation should be skipped, > # point to the prepared language model > # > #lm = > > [LM:nc] > #raw-corpus = > $wmt12-data/training/news-commentary-v7.$pair-extension.$output-extension > > [LM:un] IGNORE > #raw-corpus = > $wmt12-data/training/undoc.2000.$pair-extension.$output-extension > > [LM:news] IGNORE > #raw-corpus = $wmt12-data/training/news.$output-extension.shuffled > > [LM:nc=pos] > factors = "pos" > order = 7 > settings = "-interpolate -unk" > clean-corpus = $wmt12-data/kn.lm > > > ------------------------------------------------------------------------------------------------------------- > > Here kn.lm is my language model and training files are named as train.en > and train.kn. > In the beginning i have specified the path to my data files as: > wmt12-data = /home/development/sunayana/POS-eng-kon/corpus > > where corpus folder contains all the training, tune,LM and test files. > > I dont understand how to define GENERAL:get-corpus-script. > > Please guide me with this. Thanks > > On Fri, Jan 29, 2016 at 10:28 PM, Philipp Koehn <[email protected]> wrote: > >> Hi, >> >> you are not properly specifying your training data in the config file. >> Can you double check or post the [CORPUS] and [LM] sections of your >> config file? >> >> -phi >> >> On Thu, Jan 28, 2016 at 6:04 AM, Sunayana Gawde < >> [email protected]> wrote: >> >>> Hello all, >>> >>> I am using EMS and the config.factored file from moses website. >>> >>> My train, tune and test data is a POS tagged data in the following >>> format: >>> >>> In\IN Shimla\NNP Ice\NNP Skating\NNP Ring\NNP ,\, Roller\NNP Skatin\NNP >>> Ring\NNP etc\NN .\: are\VBP major\JJ skating\NN ring\NN .\. >>> >>> when i run the command: >>> >>> nohup nice >>> /usr/local/bin/smt/mosesdecoder-3.0/scripts/ems/experiment.perl -config >>> config.POSen-kn &> log & >>> >>> i get the error in log file: >>> ERROR: you need to define GENERAL:get-corpus-script >>> >>> Please help me. >>> >>> -- >>> *Regards* >>> >>> Ms. Sunayana R. Gawde. >>> >>> DCST, Goa University. >>> * P**leas**e don't print t**his e-mail unles**s you really need to.* >>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> >> > > > -- > *Regards* > > Ms. Sunayana R. Gawde. > > DCST, Goa University. > * P**leas**e don't print t**his e-mail unles**s you really need to.* >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
