Re: [Moses-support] Error while using config.factored

Philipp Koehn Mon, 01 Feb 2016 10:40:58 -0800

Hi,

[CORPUS:train1]


comment out
 get-corpus-script = /home/development/sunayana/POS-eng-kon/corpus
which should point to an actual script, and you already specify
 factorized-stem = $wmt12-data/train


[LM:nc=pos]

I had some problems with the "=" in corpus names, so maybe better go with
[LM:nc-pos]

What is the file "kn.lm"?
factorized-corpus = $wmt12-data/kn.lm
Did you already train a language model?
(1) if yes:
lm = $wmt12-data/kn.lm
(2) if no:
factorized-corpus = $wmt12-data/train.$output-extension

You should also have a surface word language model:
[LM:nc]
order = 5
settings = "-interpolate -unk"
factorized-corpus = $wmt12-data/train.$output-extension


[EVALUATION:test]

You should specify

factorized-input = $wmt12-data/test.en
tokenized-reference = $wmt12-data/test.kn.just-word

and not the sgm specifications.

The reference translations should not be factorized, but have only surface
forms, this is also the case for tuning:

[TUNING]

tokenized-input = $wmt12-data/tune.kn.just-word


-phi

On Mon, Feb 1, 2016 at 1:03 PM, Sunayana Gawde <[email protected]>
wrote:

> Sir,
>
> Here is my config file:
>
> On Mon, Feb 1, 2016 at 11:29 PM, Philipp Koehn <[email protected]> wrote:
>
>> Hi,
>>
>> can you send me your full config file?
>>
>> The example factored model has a surface and POS LM - so these are the
>> files.
>>
>> Using the same data for language modelling as for translation model
>> training is fine.
>>
>> -phi
>>
>> On Mon, Feb 1, 2016 at 12:07 PM, Sunayana Gawde <
>> [email protected]> wrote:
>>
>>> Sir,
>>>
>>> I have already replaced "\" with "|". But still it gives me same error.
>>>
>>> I downloaded the sample data from statmt.org website(factored corpus),
>>> It contains surface.lm and pos.lm.
>>>
>>> What are these files. Do i need to have these?
>>>
>>> I have my language model file which contains the same text data as my
>>> target train file(48500 lines).
>>>
>>> On Mon, Feb 1, 2016 at 9:28 PM, Philipp Koehn <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> one thing that will likely come up: the Moses factored setup assumes
>>>> uses the bar character "|" to separate factors, while you seem to be using
>>>> backslash "\". So, you will have to change that in your data.
>>>>
>>>> Otherwise you seem to be on the right track - yes, you need to split
>>>> your data into train/tune/test and your splits look reasonable (I'd prefer
>>>> a larger tune set for more stability, though).
>>>>
>>>> -phi
>>>>
>>>> On Mon, Feb 1, 2016 at 9:41 AM, Sunayana Gawde <
>>>> [email protected]> wrote:
>>>>
>>>>> Sir,
>>>>>
>>>>> I figured out that i need some additional input files for factored
>>>>> models.
>>>>>
>>>>> What i had was a text data of type:
>>>>>
>>>>> In\IN Shimla\NNP Ice\NNP Skating\NNP Ring\NNP ,\, Roller\NNP
>>>>> Skatin\NNP Ring\NNP etc\NN .\: are\VBP major\JJ skating\NN ring\NN .\.
>>>>>
>>>>> and same parallel data in Konkani with POS tags as well.
>>>>>
>>>>> The whole data i splitted as train (48500), tune(500) and test(1000).
>>>>> So i have total 6 files with extensions .en and .kn
>>>>>
>>>>> 1 more file i have which is language model in konkani (kn.lm)
>>>>>
>>>>> So what more i need to run a config.factored file?
>>>>>
>>>>> Your suggestions will be greatly appreciated.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Mon, Feb 1, 2016 at 3:13 PM, Sunayana Gawde <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Yeah. Now that error went. Thanks
>>>>>> But now i get this error:
>>>>>>
>>>>>> BUGGY CONFIG LINE (40): in : get-corpus-script
>>>>>> 1 ERROR IN CONFIG FILE at
>>>>>> /usr/local/bin/smt/mosesdecoder-3.0/scripts/ems/experiment.perl line 363,
>>>>>> <INI> line 698.
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 1, 2016 at 3:03 PM, Sunayana Gawde <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thanks. I did the changes. But still i get this error:
>>>>>>>
>>>>>>> ERROR: you need to define GENERAL:get-corpus-script
>>>>>>>
>>>>>>> On Sat, Jan 30, 2016 at 10:34 PM, Philipp Koehn <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> remove the IGNORE here:
>>>>>>>> [CORPUS:train1] IGNORE
>>>>>>>>
>>>>>>>> and add an IGNORE here:
>>>>>>>> [LM:nc]
>>>>>>>>
>>>>>>>> Also, your current configuration does not have a surface word
>>>>>>>> language model.
>>>>>>>> You can do this, but I would expect better results with one.
>>>>>>>>
>>>>>>>> -phi
>>>>>>>>
>>>>>>>> On Sat, Jan 30, 2016 at 2:28 AM, Sunayana Gawde <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Sir,
>>>>>>>>>
>>>>>>>>> Here is the corpus section of my config file:
>>>>>>>>>
>>>>>>>>> [CORPUS]
>>>>>>>>>
>>>>>>>>> ### long sentences are filtered out, since they slow down GIZA++
>>>>>>>>> # and are a less reliable source of data. set here the maximum
>>>>>>>>> # length of a sentence
>>>>>>>>> #
>>>>>>>>> max-sentence-length = 50
>>>>>>>>>
>>>>>>>>> [CORPUS:train1] IGNORE
>>>>>>>>>
>>>>>>>>> ### command to run to get raw corpus files
>>>>>>>>> #
>>>>>>>>> #get-corpus-script = /home/development/sunayana/POS-eng-kon/corpus
>>>>>>>>> ### raw corpus files (untokenized, but sentence aligned)
>>>>>>>>> #
>>>>>>>>> #raw-stem = $wmt12-data/training/europarl-v7.$pair-extension
>>>>>>>>>
>>>>>>>>> ### tokenized corpus files (may contain long sentences)
>>>>>>>>> #
>>>>>>>>> #tokenized-stem =
>>>>>>>>>
>>>>>>>>> ### if sentence filtering should be skipped,
>>>>>>>>> # point to the clean training data
>>>>>>>>> #
>>>>>>>>> clean-stem = $wmt12-data/train
>>>>>>>>>
>>>>>>>>> ### if corpus preparation should be skipped,
>>>>>>>>> # point to the prepared training data
>>>>>>>>> #
>>>>>>>>> #lowercased-stem =
>>>>>>>>>
>>>>>>>>> [CORPUS:nc] IGNORE
>>>>>>>>> #raw-stem = $wmt12-data/training/news-commentary-v7.$pair-extension
>>>>>>>>>
>>>>>>>>> [CORPUS:un] IGNORE
>>>>>>>>> #raw-stem = $wmt12-data/training/undoc.2000.$pair-extension
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------------------------------
>>>>>>>>> And here is my LM section:
>>>>>>>>>
>>>>>>>>> # srilm
>>>>>>>>> lm-training = $srilm-dir/ngram-count
>>>>>>>>> settings = "-interpolate -kndiscount -unk"
>>>>>>>>>
>>>>>>>>> # order of the language model
>>>>>>>>> order = 5
>>>>>>>>>
>>>>>>>>> ### tool to be used for training randomized language model from
>>>>>>>>> scratch
>>>>>>>>> # (more commonly, a SRILM is trained)
>>>>>>>>> #
>>>>>>>>> #rlm-training = "$randlm-dir/buildlm -falsepos 8 -values 8"
>>>>>>>>>
>>>>>>>>> ### script to use for binary table format for irstlm or kenlm
>>>>>>>>> # (default: no binarization)
>>>>>>>>>
>>>>>>>>> # irstlm
>>>>>>>>> #lm-binarizer = $irstlm-dir/compile-lm
>>>>>>>>>
>>>>>>>>> # kenlm, also set type to 8
>>>>>>>>> lm-binarizer = $moses-bin-dir/build_binary
>>>>>>>>> type = 8
>>>>>>>>>
>>>>>>>>> ### script to create quantized language model format (irstlm)
>>>>>>>>> # (default: no quantization)
>>>>>>>>> #
>>>>>>>>> #lm-quantizer = $irstlm-dir/quantize-lm
>>>>>>>>>
>>>>>>>>> ### script to use for converting into randomized table format
>>>>>>>>> # (default: no randomization)
>>>>>>>>> #
>>>>>>>>> #lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8"
>>>>>>>>>
>>>>>>>>> ### each language model to be used has its own section here
>>>>>>>>>
>>>>>>>>> [LM:europarl] IGNORE
>>>>>>>>>
>>>>>>>>> ### command to run to get raw corpus files
>>>>>>>>> #
>>>>>>>>> #get-corpus-script = ""
>>>>>>>>>
>>>>>>>>> ### raw corpus (untokenized)
>>>>>>>>> #
>>>>>>>>> #raw-corpus = $wmt12-data/training/europarl-v7.$output-extension
>>>>>>>>>
>>>>>>>>> ### tokenized corpus files (may contain long sentences)
>>>>>>>>> #
>>>>>>>>> #tokenized-corpus =
>>>>>>>>>
>>>>>>>>> ### if corpus preparation should be skipped,
>>>>>>>>> # point to the prepared language model
>>>>>>>>> #
>>>>>>>>> #lm =
>>>>>>>>>
>>>>>>>>> [LM:nc]
>>>>>>>>> #raw-corpus =
>>>>>>>>> $wmt12-data/training/news-commentary-v7.$pair-extension.$output-extension
>>>>>>>>>
>>>>>>>>> [LM:un] IGNORE
>>>>>>>>> #raw-corpus =
>>>>>>>>> $wmt12-data/training/undoc.2000.$pair-extension.$output-extension
>>>>>>>>>
>>>>>>>>> [LM:news] IGNORE
>>>>>>>>> #raw-corpus = $wmt12-data/training/news.$output-extension.shuffled
>>>>>>>>>
>>>>>>>>> [LM:nc=pos]
>>>>>>>>> factors = "pos"
>>>>>>>>> order = 7
>>>>>>>>> settings = "-interpolate -unk"
>>>>>>>>> clean-corpus = $wmt12-data/kn.lm
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -------------------------------------------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> Here kn.lm is my language model and training files are named as
>>>>>>>>> train.en and train.kn.
>>>>>>>>> In the beginning i have specified the path to my data files as:
>>>>>>>>> wmt12-data = /home/development/sunayana/POS-eng-kon/corpus
>>>>>>>>>
>>>>>>>>> where corpus folder contains all the training, tune,LM and test
>>>>>>>>> files.
>>>>>>>>>
>>>>>>>>> I dont understand how to define GENERAL:get-corpus-script.
>>>>>>>>>
>>>>>>>>> Please guide me with this. Thanks
>>>>>>>>>
>>>>>>>>> On Fri, Jan 29, 2016 at 10:28 PM, Philipp Koehn <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> you are not properly specifying your training data in the config
>>>>>>>>>> file.
>>>>>>>>>> Can you double check or post the [CORPUS] and [LM] sections of
>>>>>>>>>> your config file?
>>>>>>>>>>
>>>>>>>>>> -phi
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 28, 2016 at 6:04 AM, Sunayana Gawde <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello all,
>>>>>>>>>>>
>>>>>>>>>>> I am using EMS and the config.factored file from moses website.
>>>>>>>>>>>
>>>>>>>>>>> My train, tune and test data is a POS tagged data in the
>>>>>>>>>>> following format:
>>>>>>>>>>>
>>>>>>>>>>> In\IN Shimla\NNP Ice\NNP Skating\NNP Ring\NNP ,\, Roller\NNP
>>>>>>>>>>> Skatin\NNP Ring\NNP etc\NN .\: are\VBP major\JJ skating\NN ring\NN 
>>>>>>>>>>> .\.
>>>>>>>>>>>
>>>>>>>>>>> when i run the command:
>>>>>>>>>>>
>>>>>>>>>>> nohup nice
>>>>>>>>>>> /usr/local/bin/smt/mosesdecoder-3.0/scripts/ems/experiment.perl 
>>>>>>>>>>> -config
>>>>>>>>>>> config.POSen-kn  &> log &
>>>>>>>>>>>
>>>>>>>>>>> i get the error in log file:
>>>>>>>>>>> ERROR: you need to define GENERAL:get-corpus-script
>>>>>>>>>>>
>>>>>>>>>>> Please help me.
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> *Regards*
>>>>>>>>>>>
>>>>>>>>>>> Ms. Sunayana R. Gawde.
>>>>>>>>>>>
>>>>>>>>>>> DCST, Goa University.
>>>>>>>>>>> * P**leas**e don't print t**his e-mail unles**s you really need
>>>>>>>>>>> to.*
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Moses-support mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Regards*
>>>>>>>>>
>>>>>>>>> Ms. Sunayana R. Gawde.
>>>>>>>>>
>>>>>>>>> DCST, Goa University.
>>>>>>>>> * P**leas**e don't print t**his e-mail unles**s you really need
>>>>>>>>> to.*
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Regards*
>>>>>>>
>>>>>>> Ms. Sunayana R. Gawde.
>>>>>>>
>>>>>>> DCST, Goa University.
>>>>>>> * P**leas**e don't print t**his e-mail unles**s you really need to.*
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Regards*
>>>>>>
>>>>>> Ms. Sunayana R. Gawde.
>>>>>>
>>>>>> DCST, Goa University.
>>>>>> * P**leas**e don't print t**his e-mail unles**s you really need to.*
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Regards*
>>>>>
>>>>> Ms. Sunayana R. Gawde.
>>>>>
>>>>> DCST, Goa University.
>>>>> * P**leas**e don't print t**his e-mail unles**s you really need to.*
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Regards*
>>>
>>> Ms. Sunayana R. Gawde.
>>>
>>> DCST, Goa University.
>>> * P**leas**e don't print t**his e-mail unles**s you really need to.*
>>>
>>
>>
>
>
> --
> *Regards*
>
> Ms. Sunayana R. Gawde.
>
> DCST, Goa University.
> * P**leas**e don't print t**his e-mail unles**s you really need to.*
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Error while using config.factored

Reply via email to