yes I only used the $SCRIPTS_ROOTDIR/training/clean-corpus-n.perl on both the English and Arabic training set. I tokenized the english part with moses tokenizer , where i used the lowercase.perl and tokenizer.perl script. the Arabic part is tokenized using MADA tool, and all these characters were normalized <, > , and | into Latin characters I am expecting that some weird character appearing in the corpus. When I have such a a case, usually the training script would print the sentence with such characters and would stop building the phrase table. What I usually previously did is that I manually removed the sentence from the training data and alignment file and continue running the training script, which ends successfully afterwards, with tuning and decoding.
In this case I can see that the training ended successfully, and all the following files were generated : 419M model/aligned.0,1.ar 224M model/aligned.0.ar 256M model/aligned.0.en 236M model/aligned.grow-diag-final-and 1.2G model/extract.0-0,1.inv.sorted.gz 1.2G model/extract.0-0,1.sorted.gz 870M model/extract.0-0.o.sorted.gz 92M model/lex.0-0,1.e2f 92M model/lex.0-0,1.f2e 4.0K model/moses.ini 2.6G model/phrase-table.0-0,1.gz 931M model/reordering-table.0-0.wbe-msd-bidirectional-fe.gz The error is only when the phrase table is loaded from the filtered directory: 4.0K filtered/info 260K filtered/input.1002 4.0K filtered/moses.ini 235M filtered/phrase-table.0-0,1.1.1.gz 957M filtered/reordering-table.0-0.wbe-msd-bidirectional-fe I even tried decoding alone with out mert on the test set after filtering the phrase table using filter-model-given-input.pl script, and it gave the same error. If that is the case, is there a way to know on which phrase pair did loading fail ? On Sat, Oct 18, 2014 at 10:58 AM, Hieu Hoang <[email protected]> wrote: > the moses.ini looks ok. Did you clean your training data? Did you tokenize > it with the moses tokenizer? Did you do anything to your phrase-table? > > On 18 October 2014 17:49, Mohammad Salameh <[email protected]> wrote: > >> Hi Hieu >> Please find the moses.ini file attached >> the exact commands are: >> >> >> >> ####TRAIN TM >> $SCRIPTS_ROOTDIR/training/train-model.perl -root-dir $WORK >> -external-bin-dir $MGIZA_HOME -corpus $WORK/corpus/trn.fil -f en -e ar >> -alignment grow-diag-final-and -max-phrase-length 8 --translation-factors >> 0-0,1 --alignment-factors 0-1 -reordering msd-bidirectional-fe -mgiza -lm >> 0:5:$WORK/lm/ar_surf.lm &>$WORK/training.out >> >> ####TUNE >> mkdir $WORK/tuning/mertA >> SCRIPTS_ROOTDIR/training/mert-moses.pl $WORK/tuning/dev.en $WORK/tuning/ >> dev.ar $MOSES $WORK/model/moses.ini --working-dir $WORK/tuning/mertA >> --mertdir $MOSES_HOME/bin --decoder-flags "-threads 11 -max-phrase-length >> 8" --threads 11 &> $WORK/tuning/mertA/mert.out >> >> >> Thanks, >> Mohammad >> >> On Sat, Oct 18, 2014 at 6:20 AM, Hieu Hoang <[email protected]> wrote: >> >>> hi mohammad >>> >>> >>> On 17 October 2014 21:45, Mohammad Salameh <[email protected]> wrote: >>> >>>> Thanks Hieu, >>>> I wan to exclude the <s> because I want to translate chunks of source >>>> sentences with one model, and then add them and their score as extra >>>> feature to a phrase table of a different model. >>>> So I don't want the sentence boundaries to be involved in the >>>> translation. >>>> >>> I understand. Moses doesn't allow you to exclude <s>, however, if you >>> don't want the score for this, then maybe you should write a feature >>> function to subtract it from the score. Or modify an existing language >>> model to not score <s> >>> >>>> >>>> Also, I trained a factored system with --translation-factors 0-0,1. >>>> The training process ended successfully and I do not see any error with the >>>> training.out file. >>>> But the tuning and decoding is ending up with Segmentation Fault error >>>> when loading the phrase table and when it reaches 3% when loading. >>>> I have attached the mert.out. >>>> Would it be possible to know the reason, or which phrases in the phrase >>>> table is causing the interruption in loading? >>>> >>> Can you also send the moses.ini file you used, and the EXACT command you >>> executed. >>> >>> >>>> Thanks, >>>> Salameh >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Fri, Oct 17, 2014 at 12:57 PM, Hieu Hoang <[email protected]> >>>> wrote: >>>> >>>>> sorry, must have missed your email. Answers below >>>>> >>>>> On 16/10/14 20:21, Mohammad Salameh wrote: >>>>> >>>>> Hi, >>>>> any answer to the above questions, >>>>> Thanks, >>>>> Salameh >>>>> >>>>> On Fri, Oct 10, 2014 at 10:11 AM, Mohammad Salameh < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi >>>>>> I have few questions on how Moses system works >>>>>> >>>>>> 1) would it be possible to do a factored translation where factors >>>>>> appear in the output but do not be part of the translation process. For >>>>>> example, I have English surface form on source side and Arabic surface >>>>>> and >>>>>> their stems on the target side. I want to translate from English surface >>>>>> form to Arabic surface, but also see the stems accompanying the surface >>>>>> forms in the output. >>>>>> I have tried setting --translation-factors 0-0 , but only ended up >>>>>> with the Arabic surface forms in the output. >>>>>> >>>>> I'm not sure what you mean by 'not be part of the translation >>>>> process'. If you want to see the stem in the output but you don't want it >>>>> in the translation table, then there needs to be some process that >>>>> generate >>>>> the stem, given the target word. Moses has a crude solution - it is called >>>>> the generation step. >>>>> >>>>> >>>>>> >>>>>> 2) when translating sentences with moses , I assume that moses adds >>>>>> the sentence boundary markers <s> </s> automatically. Would it be >>>>>> possible >>>>>> to exclude these from the translation. I need to get translation scores >>>>>> for >>>>>> chunks of input sentences which does not involve scores generated based >>>>>> on >>>>>> <s> and </s> from LM or phrase table. >>>>>> >>>>> Yes, it include <s> </s>. No, you can't exclude these from the >>>>> translation process. >>>>> >>>>> I'm curious to know why you want to exclude these >>>>> >>>>> >>>>>> 3) I added additional phrases to the phrase table. Should the >>>>>> phrase table be sorted again and is it enough to do "LC_ALL=C sort " on >>>>>> the >>>>>> PT to be used properly ? >>>>>> >>>>> Yes, it needs to be sorted again. You must also make sure that the >>>>> new phrases are not duplicates of existing phrases >>>>> >>>>> >>>>>> Thanks >>>>>> >>>>>> _______________________________________________ >>>>>> Moses-support mailing list >>>>>> [email protected] >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>> >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Moses-support mailing >>>>> [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> Hieu Hoang >>> Research Associate >>> University of Edinburgh >>> http://www.hoang.co.uk/hieu >>> >>> >> > > > -- > Hieu Hoang > Research Associate > University of Edinburgh > http://www.hoang.co.uk/hieu > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
