Re: [Moses-support] Error while using config.factored

Philipp Koehn Sat, 30 Jan 2016 09:08:29 -0800

Hi,

remove the IGNORE here:
[CORPUS:train1] IGNORE


and add an IGNORE here:
[LM:nc]

Also, your current configuration does not have a surface word language
model.
You can do this, but I would expect better results with one.

-phi

On Sat, Jan 30, 2016 at 2:28 AM, Sunayana Gawde <[email protected]>
wrote:

> Sir,
>
> Here is the corpus section of my config file:
>
> [CORPUS]
>
> ### long sentences are filtered out, since they slow down GIZA++
> # and are a less reliable source of data. set here the maximum
> # length of a sentence
> #
> max-sentence-length = 50
>
> [CORPUS:train1] IGNORE
>
> ### command to run to get raw corpus files
> #
> #get-corpus-script = /home/development/sunayana/POS-eng-kon/corpus
> ### raw corpus files (untokenized, but sentence aligned)
> #
> #raw-stem = $wmt12-data/training/europarl-v7.$pair-extension
>
> ### tokenized corpus files (may contain long sentences)
> #
> #tokenized-stem =
>
> ### if sentence filtering should be skipped,
> # point to the clean training data
> #
> clean-stem = $wmt12-data/train
>
> ### if corpus preparation should be skipped,
> # point to the prepared training data
> #
> #lowercased-stem =
>
> [CORPUS:nc] IGNORE
> #raw-stem = $wmt12-data/training/news-commentary-v7.$pair-extension
>
> [CORPUS:un] IGNORE
> #raw-stem = $wmt12-data/training/undoc.2000.$pair-extension
>
>
> ---------------------------------------------------------------------------------------------
> And here is my LM section:
>
> # srilm
> lm-training = $srilm-dir/ngram-count
> settings = "-interpolate -kndiscount -unk"
>
> # order of the language model
> order = 5
>
> ### tool to be used for training randomized language model from scratch
> # (more commonly, a SRILM is trained)
> #
> #rlm-training = "$randlm-dir/buildlm -falsepos 8 -values 8"
>
> ### script to use for binary table format for irstlm or kenlm
> # (default: no binarization)
>
> # irstlm
> #lm-binarizer = $irstlm-dir/compile-lm
>
> # kenlm, also set type to 8
> lm-binarizer = $moses-bin-dir/build_binary
> type = 8
>
> ### script to create quantized language model format (irstlm)
> # (default: no quantization)
> #
> #lm-quantizer = $irstlm-dir/quantize-lm
>
> ### script to use for converting into randomized table format
> # (default: no randomization)
> #
> #lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8"
>
> ### each language model to be used has its own section here
>
> [LM:europarl] IGNORE
>
> ### command to run to get raw corpus files
> #
> #get-corpus-script = ""
>
> ### raw corpus (untokenized)
> #
> #raw-corpus = $wmt12-data/training/europarl-v7.$output-extension
>
> ### tokenized corpus files (may contain long sentences)
> #
> #tokenized-corpus =
>
> ### if corpus preparation should be skipped,
> # point to the prepared language model
> #
> #lm =
>
> [LM:nc]
> #raw-corpus =
> $wmt12-data/training/news-commentary-v7.$pair-extension.$output-extension
>
> [LM:un] IGNORE
> #raw-corpus =
> $wmt12-data/training/undoc.2000.$pair-extension.$output-extension
>
> [LM:news] IGNORE
> #raw-corpus = $wmt12-data/training/news.$output-extension.shuffled
>
> [LM:nc=pos]
> factors = "pos"
> order = 7
> settings = "-interpolate -unk"
> clean-corpus = $wmt12-data/kn.lm
>
>
> -------------------------------------------------------------------------------------------------------------
>
> Here kn.lm is my language model and training files are named as train.en
> and train.kn.
> In the beginning i have specified the path to my data files as:
> wmt12-data = /home/development/sunayana/POS-eng-kon/corpus
>
> where corpus folder contains all the training, tune,LM and test files.
>
> I dont understand how to define GENERAL:get-corpus-script.
>
> Please guide me with this. Thanks
>
> On Fri, Jan 29, 2016 at 10:28 PM, Philipp Koehn <[email protected]> wrote:
>
>> Hi,
>>
>> you are not properly specifying your training data in the config file.
>> Can you double check or post the [CORPUS] and [LM] sections of your
>> config file?
>>
>> -phi
>>
>> On Thu, Jan 28, 2016 at 6:04 AM, Sunayana Gawde <
>> [email protected]> wrote:
>>
>>> Hello all,
>>>
>>> I am using EMS and the config.factored file from moses website.
>>>
>>> My train, tune and test data is a POS tagged data in the following
>>> format:
>>>
>>> In\IN Shimla\NNP Ice\NNP Skating\NNP Ring\NNP ,\, Roller\NNP Skatin\NNP
>>> Ring\NNP etc\NN .\: are\VBP major\JJ skating\NN ring\NN .\.
>>>
>>> when i run the command:
>>>
>>> nohup nice
>>> /usr/local/bin/smt/mosesdecoder-3.0/scripts/ems/experiment.perl -config
>>> config.POSen-kn  &> log &
>>>
>>> i get the error in log file:
>>> ERROR: you need to define GENERAL:get-corpus-script
>>>
>>> Please help me.
>>>
>>> --
>>> *Regards*
>>>
>>> Ms. Sunayana R. Gawde.
>>>
>>> DCST, Goa University.
>>> * P**leas**e don't print t**his e-mail unles**s you really need to.*
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>
>
>
> --
> *Regards*
>
> Ms. Sunayana R. Gawde.
>
> DCST, Goa University.
> * P**leas**e don't print t**his e-mail unles**s you really need to.*
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Error while using config.factored

Reply via email to