Hi, [CORPUS:train1]
comment out get-corpus-script = /home/development/sunayana/POS-eng-kon/corpus which should point to an actual script, and you already specify factorized-stem = $wmt12-data/train [LM:nc=pos] I had some problems with the "=" in corpus names, so maybe better go with [LM:nc-pos] What is the file "kn.lm"? factorized-corpus = $wmt12-data/kn.lm Did you already train a language model? (1) if yes: lm = $wmt12-data/kn.lm (2) if no: factorized-corpus = $wmt12-data/train.$output-extension You should also have a surface word language model: [LM:nc] order = 5 settings = "-interpolate -unk" factorized-corpus = $wmt12-data/train.$output-extension [EVALUATION:test] You should specify factorized-input = $wmt12-data/test.en tokenized-reference = $wmt12-data/test.kn.just-word and not the sgm specifications. The reference translations should not be factorized, but have only surface forms, this is also the case for tuning: [TUNING] tokenized-input = $wmt12-data/tune.kn.just-word -phi On Mon, Feb 1, 2016 at 1:03 PM, Sunayana Gawde <[email protected]> wrote: > Sir, > > Here is my config file: > > On Mon, Feb 1, 2016 at 11:29 PM, Philipp Koehn <[email protected]> wrote: > >> Hi, >> >> can you send me your full config file? >> >> The example factored model has a surface and POS LM - so these are the >> files. >> >> Using the same data for language modelling as for translation model >> training is fine. >> >> -phi >> >> On Mon, Feb 1, 2016 at 12:07 PM, Sunayana Gawde < >> [email protected]> wrote: >> >>> Sir, >>> >>> I have already replaced "\" with "|". But still it gives me same error. >>> >>> I downloaded the sample data from statmt.org website(factored corpus), >>> It contains surface.lm and pos.lm. >>> >>> What are these files. Do i need to have these? >>> >>> I have my language model file which contains the same text data as my >>> target train file(48500 lines). >>> >>> On Mon, Feb 1, 2016 at 9:28 PM, Philipp Koehn <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> one thing that will likely come up: the Moses factored setup assumes >>>> uses the bar character "|" to separate factors, while you seem to be using >>>> backslash "\". So, you will have to change that in your data. >>>> >>>> Otherwise you seem to be on the right track - yes, you need to split >>>> your data into train/tune/test and your splits look reasonable (I'd prefer >>>> a larger tune set for more stability, though). >>>> >>>> -phi >>>> >>>> On Mon, Feb 1, 2016 at 9:41 AM, Sunayana Gawde < >>>> [email protected]> wrote: >>>> >>>>> Sir, >>>>> >>>>> I figured out that i need some additional input files for factored >>>>> models. >>>>> >>>>> What i had was a text data of type: >>>>> >>>>> In\IN Shimla\NNP Ice\NNP Skating\NNP Ring\NNP ,\, Roller\NNP >>>>> Skatin\NNP Ring\NNP etc\NN .\: are\VBP major\JJ skating\NN ring\NN .\. >>>>> >>>>> and same parallel data in Konkani with POS tags as well. >>>>> >>>>> The whole data i splitted as train (48500), tune(500) and test(1000). >>>>> So i have total 6 files with extensions .en and .kn >>>>> >>>>> 1 more file i have which is language model in konkani (kn.lm) >>>>> >>>>> So what more i need to run a config.factored file? >>>>> >>>>> Your suggestions will be greatly appreciated. >>>>> >>>>> Thanks >>>>> >>>>> >>>>> On Mon, Feb 1, 2016 at 3:13 PM, Sunayana Gawde < >>>>> [email protected]> wrote: >>>>> >>>>>> Yeah. Now that error went. Thanks >>>>>> But now i get this error: >>>>>> >>>>>> BUGGY CONFIG LINE (40): in : get-corpus-script >>>>>> 1 ERROR IN CONFIG FILE at >>>>>> /usr/local/bin/smt/mosesdecoder-3.0/scripts/ems/experiment.perl line 363, >>>>>> <INI> line 698. >>>>>> >>>>>> >>>>>> On Mon, Feb 1, 2016 at 3:03 PM, Sunayana Gawde < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Thanks. I did the changes. But still i get this error: >>>>>>> >>>>>>> ERROR: you need to define GENERAL:get-corpus-script >>>>>>> >>>>>>> On Sat, Jan 30, 2016 at 10:34 PM, Philipp Koehn <[email protected]> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> remove the IGNORE here: >>>>>>>> [CORPUS:train1] IGNORE >>>>>>>> >>>>>>>> and add an IGNORE here: >>>>>>>> [LM:nc] >>>>>>>> >>>>>>>> Also, your current configuration does not have a surface word >>>>>>>> language model. >>>>>>>> You can do this, but I would expect better results with one. >>>>>>>> >>>>>>>> -phi >>>>>>>> >>>>>>>> On Sat, Jan 30, 2016 at 2:28 AM, Sunayana Gawde < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Sir, >>>>>>>>> >>>>>>>>> Here is the corpus section of my config file: >>>>>>>>> >>>>>>>>> [CORPUS] >>>>>>>>> >>>>>>>>> ### long sentences are filtered out, since they slow down GIZA++ >>>>>>>>> # and are a less reliable source of data. set here the maximum >>>>>>>>> # length of a sentence >>>>>>>>> # >>>>>>>>> max-sentence-length = 50 >>>>>>>>> >>>>>>>>> [CORPUS:train1] IGNORE >>>>>>>>> >>>>>>>>> ### command to run to get raw corpus files >>>>>>>>> # >>>>>>>>> #get-corpus-script = /home/development/sunayana/POS-eng-kon/corpus >>>>>>>>> ### raw corpus files (untokenized, but sentence aligned) >>>>>>>>> # >>>>>>>>> #raw-stem = $wmt12-data/training/europarl-v7.$pair-extension >>>>>>>>> >>>>>>>>> ### tokenized corpus files (may contain long sentences) >>>>>>>>> # >>>>>>>>> #tokenized-stem = >>>>>>>>> >>>>>>>>> ### if sentence filtering should be skipped, >>>>>>>>> # point to the clean training data >>>>>>>>> # >>>>>>>>> clean-stem = $wmt12-data/train >>>>>>>>> >>>>>>>>> ### if corpus preparation should be skipped, >>>>>>>>> # point to the prepared training data >>>>>>>>> # >>>>>>>>> #lowercased-stem = >>>>>>>>> >>>>>>>>> [CORPUS:nc] IGNORE >>>>>>>>> #raw-stem = $wmt12-data/training/news-commentary-v7.$pair-extension >>>>>>>>> >>>>>>>>> [CORPUS:un] IGNORE >>>>>>>>> #raw-stem = $wmt12-data/training/undoc.2000.$pair-extension >>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------------------------------- >>>>>>>>> And here is my LM section: >>>>>>>>> >>>>>>>>> # srilm >>>>>>>>> lm-training = $srilm-dir/ngram-count >>>>>>>>> settings = "-interpolate -kndiscount -unk" >>>>>>>>> >>>>>>>>> # order of the language model >>>>>>>>> order = 5 >>>>>>>>> >>>>>>>>> ### tool to be used for training randomized language model from >>>>>>>>> scratch >>>>>>>>> # (more commonly, a SRILM is trained) >>>>>>>>> # >>>>>>>>> #rlm-training = "$randlm-dir/buildlm -falsepos 8 -values 8" >>>>>>>>> >>>>>>>>> ### script to use for binary table format for irstlm or kenlm >>>>>>>>> # (default: no binarization) >>>>>>>>> >>>>>>>>> # irstlm >>>>>>>>> #lm-binarizer = $irstlm-dir/compile-lm >>>>>>>>> >>>>>>>>> # kenlm, also set type to 8 >>>>>>>>> lm-binarizer = $moses-bin-dir/build_binary >>>>>>>>> type = 8 >>>>>>>>> >>>>>>>>> ### script to create quantized language model format (irstlm) >>>>>>>>> # (default: no quantization) >>>>>>>>> # >>>>>>>>> #lm-quantizer = $irstlm-dir/quantize-lm >>>>>>>>> >>>>>>>>> ### script to use for converting into randomized table format >>>>>>>>> # (default: no randomization) >>>>>>>>> # >>>>>>>>> #lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8" >>>>>>>>> >>>>>>>>> ### each language model to be used has its own section here >>>>>>>>> >>>>>>>>> [LM:europarl] IGNORE >>>>>>>>> >>>>>>>>> ### command to run to get raw corpus files >>>>>>>>> # >>>>>>>>> #get-corpus-script = "" >>>>>>>>> >>>>>>>>> ### raw corpus (untokenized) >>>>>>>>> # >>>>>>>>> #raw-corpus = $wmt12-data/training/europarl-v7.$output-extension >>>>>>>>> >>>>>>>>> ### tokenized corpus files (may contain long sentences) >>>>>>>>> # >>>>>>>>> #tokenized-corpus = >>>>>>>>> >>>>>>>>> ### if corpus preparation should be skipped, >>>>>>>>> # point to the prepared language model >>>>>>>>> # >>>>>>>>> #lm = >>>>>>>>> >>>>>>>>> [LM:nc] >>>>>>>>> #raw-corpus = >>>>>>>>> $wmt12-data/training/news-commentary-v7.$pair-extension.$output-extension >>>>>>>>> >>>>>>>>> [LM:un] IGNORE >>>>>>>>> #raw-corpus = >>>>>>>>> $wmt12-data/training/undoc.2000.$pair-extension.$output-extension >>>>>>>>> >>>>>>>>> [LM:news] IGNORE >>>>>>>>> #raw-corpus = $wmt12-data/training/news.$output-extension.shuffled >>>>>>>>> >>>>>>>>> [LM:nc=pos] >>>>>>>>> factors = "pos" >>>>>>>>> order = 7 >>>>>>>>> settings = "-interpolate -unk" >>>>>>>>> clean-corpus = $wmt12-data/kn.lm >>>>>>>>> >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------------------------------------------- >>>>>>>>> >>>>>>>>> Here kn.lm is my language model and training files are named as >>>>>>>>> train.en and train.kn. >>>>>>>>> In the beginning i have specified the path to my data files as: >>>>>>>>> wmt12-data = /home/development/sunayana/POS-eng-kon/corpus >>>>>>>>> >>>>>>>>> where corpus folder contains all the training, tune,LM and test >>>>>>>>> files. >>>>>>>>> >>>>>>>>> I dont understand how to define GENERAL:get-corpus-script. >>>>>>>>> >>>>>>>>> Please guide me with this. Thanks >>>>>>>>> >>>>>>>>> On Fri, Jan 29, 2016 at 10:28 PM, Philipp Koehn <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> you are not properly specifying your training data in the config >>>>>>>>>> file. >>>>>>>>>> Can you double check or post the [CORPUS] and [LM] sections of >>>>>>>>>> your config file? >>>>>>>>>> >>>>>>>>>> -phi >>>>>>>>>> >>>>>>>>>> On Thu, Jan 28, 2016 at 6:04 AM, Sunayana Gawde < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hello all, >>>>>>>>>>> >>>>>>>>>>> I am using EMS and the config.factored file from moses website. >>>>>>>>>>> >>>>>>>>>>> My train, tune and test data is a POS tagged data in the >>>>>>>>>>> following format: >>>>>>>>>>> >>>>>>>>>>> In\IN Shimla\NNP Ice\NNP Skating\NNP Ring\NNP ,\, Roller\NNP >>>>>>>>>>> Skatin\NNP Ring\NNP etc\NN .\: are\VBP major\JJ skating\NN ring\NN >>>>>>>>>>> .\. >>>>>>>>>>> >>>>>>>>>>> when i run the command: >>>>>>>>>>> >>>>>>>>>>> nohup nice >>>>>>>>>>> /usr/local/bin/smt/mosesdecoder-3.0/scripts/ems/experiment.perl >>>>>>>>>>> -config >>>>>>>>>>> config.POSen-kn &> log & >>>>>>>>>>> >>>>>>>>>>> i get the error in log file: >>>>>>>>>>> ERROR: you need to define GENERAL:get-corpus-script >>>>>>>>>>> >>>>>>>>>>> Please help me. >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> *Regards* >>>>>>>>>>> >>>>>>>>>>> Ms. Sunayana R. Gawde. >>>>>>>>>>> >>>>>>>>>>> DCST, Goa University. >>>>>>>>>>> * P**leas**e don't print t**his e-mail unles**s you really need >>>>>>>>>>> to.* >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Moses-support mailing list >>>>>>>>>>> [email protected] >>>>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Regards* >>>>>>>>> >>>>>>>>> Ms. Sunayana R. Gawde. >>>>>>>>> >>>>>>>>> DCST, Goa University. >>>>>>>>> * P**leas**e don't print t**his e-mail unles**s you really need >>>>>>>>> to.* >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Regards* >>>>>>> >>>>>>> Ms. Sunayana R. Gawde. >>>>>>> >>>>>>> DCST, Goa University. >>>>>>> * P**leas**e don't print t**his e-mail unles**s you really need to.* >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Regards* >>>>>> >>>>>> Ms. Sunayana R. Gawde. >>>>>> >>>>>> DCST, Goa University. >>>>>> * P**leas**e don't print t**his e-mail unles**s you really need to.* >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> *Regards* >>>>> >>>>> Ms. Sunayana R. Gawde. >>>>> >>>>> DCST, Goa University. >>>>> * P**leas**e don't print t**his e-mail unles**s you really need to.* >>>>> >>>> >>>> >>> >>> >>> -- >>> *Regards* >>> >>> Ms. Sunayana R. Gawde. >>> >>> DCST, Goa University. >>> * P**leas**e don't print t**his e-mail unles**s you really need to.* >>> >> >> > > > -- > *Regards* > > Ms. Sunayana R. Gawde. > > DCST, Goa University. > * P**leas**e don't print t**his e-mail unles**s you really need to.* >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
