Re: [Moses-support] using factored models

Mohammad Salameh Sun, 19 Oct 2014 12:56:25 -0700

Hi Hieu,
It seems I have found the error.
The occurrence of  double brackets in the phrase table  "[["  is causing
the problem.
Example of such phrases are :


, after ||| ,|, [[NS]]|[[NS]] bEd|bEd ||| 0.333333 0.308788 0.00021796
1.11908e-05 ||| 0-0 1-2 ||| 3 4588 1 ||| |||

It seems it is confusing such occurrence with the Hieuro Rule table format.
Although this special character seems to be escaped in the tokenizer.perl
script (which was only used for the English side of my training corpus),
I thought that clean-corpus-n.perl  escape such instances as it is applied
on both the source and target.
Thanks,
Salameh




On Sun, Oct 19, 2014 at 10:41 AM, Hieu Hoang <[email protected]> wrote:

>  ok, it looks like your data is cleaned, specifically, that the characters
> | < > have been escaped.
>
> i'm not really sure why it segfaults. Is there a disk space problem?
>
> You may have to run it with a debugger to find out. If you still can't
> find the problem after a few days, please make your filtered model files
> available for download and I'll try and debug it for you
>
>
> On 18/10/14 20:10, Mohammad Salameh wrote:
>
> yes I only used the $SCRIPTS_ROOTDIR/training/clean-corpus-n.perl on both
> the English and Arabic training set.
> I tokenized the english part with moses tokenizer , where i used
> the lowercase.perl and tokenizer.perl script.
> the Arabic part is tokenized using MADA tool, and all  these characters
> were normalized  <, > , and |  into Latin characters
>  I am expecting that some weird character appearing in the corpus.
> When I have such a a case, usually the training script would print the
> sentence with such characters and would stop building the phrase table.
> What I usually previously did is that I manually removed the sentence from
> the training data and alignment file and continue running the training
> script, which ends successfully afterwards, with tuning and decoding.
>
>  In this case I can see that the training ended successfully, and all the
> following files were generated :
>
>  419M    model/aligned.0,1.ar
> 224M    model/aligned.0.ar
> 256M    model/aligned.0.en
> 236M    model/aligned.grow-diag-final-and
> 1.2G    model/extract.0-0,1.inv.sorted.gz
> 1.2G    model/extract.0-0,1.sorted.gz
> 870M    model/extract.0-0.o.sorted.gz
> 92M     model/lex.0-0,1.e2f
> 92M     model/lex.0-0,1.f2e
> 4.0K    model/moses.ini
> 2.6G    model/phrase-table.0-0,1.gz
> 931M    model/reordering-table.0-0.wbe-msd-bidirectional-fe.gz
>
>  The error is only when the phrase table is loaded from the filtered
> directory:
>
>  4.0K    filtered/info
> 260K    filtered/input.1002
> 4.0K    filtered/moses.ini
> 235M    filtered/phrase-table.0-0,1.1.1.gz
> 957M    filtered/reordering-table.0-0.wbe-msd-bidirectional-fe
>
>  I even tried decoding alone with out mert on the test set after
> filtering the phrase table using filter-model-given-input.pl script, and
> it gave the same error.
> If that is the case, is there a way to know on which phrase pair did
> loading fail ?
>
>
> On Sat, Oct 18, 2014 at 10:58 AM, Hieu Hoang <[email protected]> wrote:
>
>> the moses.ini looks ok. Did you clean your training data? Did you
>> tokenize it with the moses tokenizer? Did you do anything to your
>> phrase-table?
>>
>> On 18 October 2014 17:49, Mohammad Salameh <[email protected]> wrote:
>>
>>> Hi Hieu
>>> Please find the moses.ini file attached
>>> the exact commands are:
>>>
>>>
>>>
>>>  ####TRAIN TM
>>> $SCRIPTS_ROOTDIR/training/train-model.perl -root-dir $WORK
>>> -external-bin-dir $MGIZA_HOME -corpus  $WORK/corpus/trn.fil -f en -e ar
>>> -alignment grow-diag-final-and -max-phrase-length 8 --translation-factors
>>> 0-0,1 --alignment-factors 0-1 -reordering msd-bidirectional-fe -mgiza -lm
>>> 0:5:$WORK/lm/ar_surf.lm &>$WORK/training.out
>>>
>>>  ####TUNE
>>> mkdir $WORK/tuning/mertA
>>> SCRIPTS_ROOTDIR/training/mert-moses.pl $WORK/tuning/dev.en $WORK/tuning/
>>> dev.ar $MOSES $WORK/model/moses.ini --working-dir $WORK/tuning/mertA
>>> --mertdir $MOSES_HOME/bin  --decoder-flags "-threads 11 -max-phrase-length
>>> 8" --threads 11 &> $WORK/tuning/mertA/mert.out
>>>
>>>
>>>  Thanks,
>>> Mohammad
>>>
>>> On Sat, Oct 18, 2014 at 6:20 AM, Hieu Hoang <[email protected]> wrote:
>>>
>>>> hi mohammad
>>>>
>>>>
>>>> On 17 October 2014 21:45, Mohammad Salameh <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks Hieu,
>>>>> I wan to exclude the <s> because I want to translate chunks of source
>>>>> sentences  with one model, and then add them  and their score as extra
>>>>> feature to a phrase table of a different model.
>>>>> So I don't want the sentence boundaries to be involved in the
>>>>> translation.
>>>>>
>>>>  I understand. Moses doesn't allow you to exclude <s>, however, if you
>>>> don't want the score for this, then maybe you should write a feature
>>>> function to subtract it from the score. Or modify an existing language
>>>> model to not score <s>
>>>>
>>>>>
>>>>>  Also, I trained a factored system with  --translation-factors 0-0,1.
>>>>> The training process ended successfully and I do not see any error with 
>>>>> the
>>>>> training.out file.
>>>>> But the tuning and decoding is ending up with Segmentation Fault error
>>>>> when loading the phrase table and when it reaches 3% when loading.
>>>>> I have attached the mert.out.
>>>>> Would it be possible to know the reason, or which phrases in the
>>>>> phrase table is causing the interruption in loading?
>>>>>
>>>>  Can you also send the moses.ini file you used, and the EXACT command
>>>> you executed.
>>>>
>>>>
>>>>>  Thanks,
>>>>> Salameh
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 17, 2014 at 12:57 PM, Hieu Hoang <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>  sorry, must have missed your email. Answers below
>>>>>>
>>>>>> On 16/10/14 20:21, Mohammad Salameh wrote:
>>>>>>
>>>>>> Hi,
>>>>>> any answer to the above questions,
>>>>>> Thanks,
>>>>>> Salameh
>>>>>>
>>>>>> On Fri, Oct 10, 2014 at 10:11 AM, Mohammad Salameh <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>> I have few questions on how Moses system works
>>>>>>>
>>>>>>>  1) would it be possible to do a factored translation where factors
>>>>>>> appear in the output but do not be part of the translation process. For
>>>>>>> example, I have English surface form on source side and  Arabic surface 
>>>>>>> and
>>>>>>> their stems on the target side. I want to translate from English surface
>>>>>>> form to Arabic surface, but also see the stems accompanying the surface
>>>>>>> forms in the output.
>>>>>>>  I have tried setting --translation-factors 0-0 , but only ended up
>>>>>>> with the Arabic surface forms in the output.
>>>>>>>
>>>>>>   I'm not sure what you mean by 'not be part of the translation
>>>>>> process'. If you want to see the stem in the output but you don't want it
>>>>>> in the translation table, then there needs to be some process that 
>>>>>> generate
>>>>>> the stem, given the target word. Moses has a crude solution - it is 
>>>>>> called
>>>>>> the generation step.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>  2) when translating sentences with moses , I assume that moses
>>>>>>> adds the sentence boundary markers <s> </s> automatically. Would it be
>>>>>>> possible to exclude these from the translation. I need to get 
>>>>>>> translation
>>>>>>> scores for chunks of input sentences which does not involve scores
>>>>>>> generated based on <s> and </s> from LM or phrase table.
>>>>>>>
>>>>>>   Yes, it include <s> </s>. No, you can't exclude these from the
>>>>>> translation process.
>>>>>>
>>>>>> I'm curious to know why you want to exclude these
>>>>>>
>>>>>>
>>>>>>>  3) I added additional phrases to the phrase table. Should the
>>>>>>> phrase table be sorted again and is it enough to do "LC_ALL=C sort " on 
>>>>>>> the
>>>>>>> PT to be used properly ?
>>>>>>>
>>>>>>   Yes, it needs to be sorted again. You must also make sure that the
>>>>>> new phrases are not duplicates of existing phrases
>>>>>>
>>>>>>
>>>>>>>  Thanks
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> [email protected]
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing 
>>>>>> [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>  --
>>>> Hieu Hoang
>>>> Research Associate
>>>> University of Edinburgh
>>>> http://www.hoang.co.uk/hieu
>>>>
>>>>
>>>
>>
>>
>>  --
>> Hieu Hoang
>> Research Associate
>> University of Edinburgh
>> http://www.hoang.co.uk/hieu
>>
>>
>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] using factored models

Reply via email to