Yes, that's the problem The tokenizer escapes these characters. Alternatively, the escape-special-char script escape these character but doesn't tokenize On 19 Oct 2014 20:53, "Mohammad Salameh" <[email protected]> wrote:
> Hi Hieu, > It seems I have found the error. > The occurrence of double brackets in the phrase table "[[" is causing > the problem. > Example of such phrases are : > > , after ||| ,|, [[NS]]|[[NS]] bEd|bEd ||| 0.333333 0.308788 0.00021796 > 1.11908e-05 ||| 0-0 1-2 ||| 3 4588 1 ||| ||| > > It seems it is confusing such occurrence with the Hieuro Rule table format. > Although this special character seems to be escaped in the tokenizer.perl > script (which was only used for the English side of my training corpus), > I thought that clean-corpus-n.perl escape such instances as it is > applied on both the source and target. > Thanks, > Salameh > > > > > On Sun, Oct 19, 2014 at 10:41 AM, Hieu Hoang <[email protected]> wrote: > >> ok, it looks like your data is cleaned, specifically, that the >> characters | < > have been escaped. >> >> i'm not really sure why it segfaults. Is there a disk space problem? >> >> You may have to run it with a debugger to find out. If you still can't >> find the problem after a few days, please make your filtered model files >> available for download and I'll try and debug it for you >> >> >> On 18/10/14 20:10, Mohammad Salameh wrote: >> >> yes I only used the $SCRIPTS_ROOTDIR/training/clean-corpus-n.perl on >> both the English and Arabic training set. >> I tokenized the english part with moses tokenizer , where i used >> the lowercase.perl and tokenizer.perl script. >> the Arabic part is tokenized using MADA tool, and all these characters >> were normalized <, > , and | into Latin characters >> I am expecting that some weird character appearing in the corpus. >> When I have such a a case, usually the training script would print the >> sentence with such characters and would stop building the phrase table. >> What I usually previously did is that I manually removed the sentence >> from the training data and alignment file and continue running the training >> script, which ends successfully afterwards, with tuning and decoding. >> >> In this case I can see that the training ended successfully, and all >> the following files were generated : >> >> 419M model/aligned.0,1.ar >> 224M model/aligned.0.ar >> 256M model/aligned.0.en >> 236M model/aligned.grow-diag-final-and >> 1.2G model/extract.0-0,1.inv.sorted.gz >> 1.2G model/extract.0-0,1.sorted.gz >> 870M model/extract.0-0.o.sorted.gz >> 92M model/lex.0-0,1.e2f >> 92M model/lex.0-0,1.f2e >> 4.0K model/moses.ini >> 2.6G model/phrase-table.0-0,1.gz >> 931M model/reordering-table.0-0.wbe-msd-bidirectional-fe.gz >> >> The error is only when the phrase table is loaded from the filtered >> directory: >> >> 4.0K filtered/info >> 260K filtered/input.1002 >> 4.0K filtered/moses.ini >> 235M filtered/phrase-table.0-0,1.1.1.gz >> 957M filtered/reordering-table.0-0.wbe-msd-bidirectional-fe >> >> I even tried decoding alone with out mert on the test set after >> filtering the phrase table using filter-model-given-input.pl script, and >> it gave the same error. >> If that is the case, is there a way to know on which phrase pair did >> loading fail ? >> >> >> On Sat, Oct 18, 2014 at 10:58 AM, Hieu Hoang <[email protected]> wrote: >> >>> the moses.ini looks ok. Did you clean your training data? Did you >>> tokenize it with the moses tokenizer? Did you do anything to your >>> phrase-table? >>> >>> On 18 October 2014 17:49, Mohammad Salameh <[email protected]> wrote: >>> >>>> Hi Hieu >>>> Please find the moses.ini file attached >>>> the exact commands are: >>>> >>>> >>>> >>>> ####TRAIN TM >>>> $SCRIPTS_ROOTDIR/training/train-model.perl -root-dir $WORK >>>> -external-bin-dir $MGIZA_HOME -corpus $WORK/corpus/trn.fil -f en -e ar >>>> -alignment grow-diag-final-and -max-phrase-length 8 --translation-factors >>>> 0-0,1 --alignment-factors 0-1 -reordering msd-bidirectional-fe -mgiza -lm >>>> 0:5:$WORK/lm/ar_surf.lm &>$WORK/training.out >>>> >>>> ####TUNE >>>> mkdir $WORK/tuning/mertA >>>> SCRIPTS_ROOTDIR/training/mert-moses.pl $WORK/tuning/dev.en >>>> $WORK/tuning/dev.ar $MOSES $WORK/model/moses.ini --working-dir >>>> $WORK/tuning/mertA --mertdir $MOSES_HOME/bin --decoder-flags "-threads 11 >>>> -max-phrase-length 8" --threads 11 &> $WORK/tuning/mertA/mert.out >>>> >>>> >>>> Thanks, >>>> Mohammad >>>> >>>> On Sat, Oct 18, 2014 at 6:20 AM, Hieu Hoang <[email protected]> >>>> wrote: >>>> >>>>> hi mohammad >>>>> >>>>> >>>>> On 17 October 2014 21:45, Mohammad Salameh <[email protected]> >>>>> wrote: >>>>> >>>>>> Thanks Hieu, >>>>>> I wan to exclude the <s> because I want to translate chunks of source >>>>>> sentences with one model, and then add them and their score as extra >>>>>> feature to a phrase table of a different model. >>>>>> So I don't want the sentence boundaries to be involved in the >>>>>> translation. >>>>>> >>>>> I understand. Moses doesn't allow you to exclude <s>, however, if >>>>> you don't want the score for this, then maybe you should write a feature >>>>> function to subtract it from the score. Or modify an existing language >>>>> model to not score <s> >>>>> >>>>>> >>>>>> Also, I trained a factored system with --translation-factors >>>>>> 0-0,1. The training process ended successfully and I do not see any error >>>>>> with the training.out file. >>>>>> But the tuning and decoding is ending up with Segmentation Fault >>>>>> error when loading the phrase table and when it reaches 3% when loading. >>>>>> I have attached the mert.out. >>>>>> Would it be possible to know the reason, or which phrases in the >>>>>> phrase table is causing the interruption in loading? >>>>>> >>>>> Can you also send the moses.ini file you used, and the EXACT command >>>>> you executed. >>>>> >>>>> >>>>>> Thanks, >>>>>> Salameh >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Oct 17, 2014 at 12:57 PM, Hieu Hoang <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> sorry, must have missed your email. Answers below >>>>>>> >>>>>>> On 16/10/14 20:21, Mohammad Salameh wrote: >>>>>>> >>>>>>> Hi, >>>>>>> any answer to the above questions, >>>>>>> Thanks, >>>>>>> Salameh >>>>>>> >>>>>>> On Fri, Oct 10, 2014 at 10:11 AM, Mohammad Salameh < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi >>>>>>>> I have few questions on how Moses system works >>>>>>>> >>>>>>>> 1) would it be possible to do a factored translation where >>>>>>>> factors appear in the output but do not be part of the translation >>>>>>>> process. >>>>>>>> For example, I have English surface form on source side and Arabic >>>>>>>> surface >>>>>>>> and their stems on the target side. I want to translate from English >>>>>>>> surface form to Arabic surface, but also see the stems accompanying the >>>>>>>> surface forms in the output. >>>>>>>> I have tried setting --translation-factors 0-0 , but only ended >>>>>>>> up with the Arabic surface forms in the output. >>>>>>>> >>>>>>> I'm not sure what you mean by 'not be part of the translation >>>>>>> process'. If you want to see the stem in the output but you don't want >>>>>>> it >>>>>>> in the translation table, then there needs to be some process that >>>>>>> generate >>>>>>> the stem, given the target word. Moses has a crude solution - it is >>>>>>> called >>>>>>> the generation step. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> 2) when translating sentences with moses , I assume that moses >>>>>>>> adds the sentence boundary markers <s> </s> automatically. Would it be >>>>>>>> possible to exclude these from the translation. I need to get >>>>>>>> translation >>>>>>>> scores for chunks of input sentences which does not involve scores >>>>>>>> generated based on <s> and </s> from LM or phrase table. >>>>>>>> >>>>>>> Yes, it include <s> </s>. No, you can't exclude these from the >>>>>>> translation process. >>>>>>> >>>>>>> I'm curious to know why you want to exclude these >>>>>>> >>>>>>> >>>>>>>> 3) I added additional phrases to the phrase table. Should the >>>>>>>> phrase table be sorted again and is it enough to do "LC_ALL=C sort " >>>>>>>> on the >>>>>>>> PT to be used properly ? >>>>>>>> >>>>>>> Yes, it needs to be sorted again. You must also make sure that >>>>>>> the new phrases are not duplicates of existing phrases >>>>>>> >>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Moses-support mailing list >>>>>>>> [email protected] >>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Moses-support mailing >>>>>>> [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Hieu Hoang >>>>> Research Associate >>>>> University of Edinburgh >>>>> http://www.hoang.co.uk/hieu >>>>> >>>>> >>>> >>> >>> >>> -- >>> Hieu Hoang >>> Research Associate >>> University of Edinburgh >>> http://www.hoang.co.uk/hieu >>> >>> >> >> >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
