Yes, that's the problem

The tokenizer escapes these characters. Alternatively, the
escape-special-char script escape these character but doesn't tokenize
On 19 Oct 2014 20:53, "Mohammad Salameh" <[email protected]> wrote:

> Hi Hieu,
> It seems I have found the error.
> The occurrence of  double brackets in the phrase table  "[["  is causing
> the problem.
> Example of such phrases are :
>
> , after ||| ,|, [[NS]]|[[NS]] bEd|bEd ||| 0.333333 0.308788 0.00021796
> 1.11908e-05 ||| 0-0 1-2 ||| 3 4588 1 ||| |||
>
> It seems it is confusing such occurrence with the Hieuro Rule table format.
> Although this special character seems to be escaped in the tokenizer.perl
> script (which was only used for the English side of my training corpus),
> I thought that clean-corpus-n.perl  escape such instances as it is
> applied on both the source and target.
> Thanks,
> Salameh
>
>
>
>
> On Sun, Oct 19, 2014 at 10:41 AM, Hieu Hoang <[email protected]> wrote:
>
>>  ok, it looks like your data is cleaned, specifically, that the
>> characters | < > have been escaped.
>>
>> i'm not really sure why it segfaults. Is there a disk space problem?
>>
>> You may have to run it with a debugger to find out. If you still can't
>> find the problem after a few days, please make your filtered model files
>> available for download and I'll try and debug it for you
>>
>>
>> On 18/10/14 20:10, Mohammad Salameh wrote:
>>
>> yes I only used the $SCRIPTS_ROOTDIR/training/clean-corpus-n.perl on
>> both the English and Arabic training set.
>> I tokenized the english part with moses tokenizer , where i used
>> the lowercase.perl and tokenizer.perl script.
>> the Arabic part is tokenized using MADA tool, and all  these characters
>> were normalized  <, > , and |  into Latin characters
>>  I am expecting that some weird character appearing in the corpus.
>> When I have such a a case, usually the training script would print the
>> sentence with such characters and would stop building the phrase table.
>> What I usually previously did is that I manually removed the sentence
>> from the training data and alignment file and continue running the training
>> script, which ends successfully afterwards, with tuning and decoding.
>>
>>  In this case I can see that the training ended successfully, and all
>> the following files were generated :
>>
>>  419M    model/aligned.0,1.ar
>> 224M    model/aligned.0.ar
>> 256M    model/aligned.0.en
>> 236M    model/aligned.grow-diag-final-and
>> 1.2G    model/extract.0-0,1.inv.sorted.gz
>> 1.2G    model/extract.0-0,1.sorted.gz
>> 870M    model/extract.0-0.o.sorted.gz
>> 92M     model/lex.0-0,1.e2f
>> 92M     model/lex.0-0,1.f2e
>> 4.0K    model/moses.ini
>> 2.6G    model/phrase-table.0-0,1.gz
>> 931M    model/reordering-table.0-0.wbe-msd-bidirectional-fe.gz
>>
>>  The error is only when the phrase table is loaded from the filtered
>> directory:
>>
>>  4.0K    filtered/info
>> 260K    filtered/input.1002
>> 4.0K    filtered/moses.ini
>> 235M    filtered/phrase-table.0-0,1.1.1.gz
>> 957M    filtered/reordering-table.0-0.wbe-msd-bidirectional-fe
>>
>>  I even tried decoding alone with out mert on the test set after
>> filtering the phrase table using filter-model-given-input.pl script, and
>> it gave the same error.
>> If that is the case, is there a way to know on which phrase pair did
>> loading fail ?
>>
>>
>> On Sat, Oct 18, 2014 at 10:58 AM, Hieu Hoang <[email protected]> wrote:
>>
>>> the moses.ini looks ok. Did you clean your training data? Did you
>>> tokenize it with the moses tokenizer? Did you do anything to your
>>> phrase-table?
>>>
>>> On 18 October 2014 17:49, Mohammad Salameh <[email protected]> wrote:
>>>
>>>> Hi Hieu
>>>> Please find the moses.ini file attached
>>>> the exact commands are:
>>>>
>>>>
>>>>
>>>>  ####TRAIN TM
>>>> $SCRIPTS_ROOTDIR/training/train-model.perl -root-dir $WORK
>>>> -external-bin-dir $MGIZA_HOME -corpus  $WORK/corpus/trn.fil -f en -e ar
>>>> -alignment grow-diag-final-and -max-phrase-length 8 --translation-factors
>>>> 0-0,1 --alignment-factors 0-1 -reordering msd-bidirectional-fe -mgiza -lm
>>>> 0:5:$WORK/lm/ar_surf.lm &>$WORK/training.out
>>>>
>>>>  ####TUNE
>>>> mkdir $WORK/tuning/mertA
>>>> SCRIPTS_ROOTDIR/training/mert-moses.pl $WORK/tuning/dev.en
>>>> $WORK/tuning/dev.ar $MOSES $WORK/model/moses.ini --working-dir
>>>> $WORK/tuning/mertA --mertdir $MOSES_HOME/bin  --decoder-flags "-threads 11
>>>> -max-phrase-length 8" --threads 11 &> $WORK/tuning/mertA/mert.out
>>>>
>>>>
>>>>  Thanks,
>>>> Mohammad
>>>>
>>>> On Sat, Oct 18, 2014 at 6:20 AM, Hieu Hoang <[email protected]>
>>>> wrote:
>>>>
>>>>> hi mohammad
>>>>>
>>>>>
>>>>> On 17 October 2014 21:45, Mohammad Salameh <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks Hieu,
>>>>>> I wan to exclude the <s> because I want to translate chunks of source
>>>>>> sentences  with one model, and then add them  and their score as extra
>>>>>> feature to a phrase table of a different model.
>>>>>> So I don't want the sentence boundaries to be involved in the
>>>>>> translation.
>>>>>>
>>>>>  I understand. Moses doesn't allow you to exclude <s>, however, if
>>>>> you don't want the score for this, then maybe you should write a feature
>>>>> function to subtract it from the score. Or modify an existing language
>>>>> model to not score <s>
>>>>>
>>>>>>
>>>>>>  Also, I trained a factored system with  --translation-factors
>>>>>> 0-0,1. The training process ended successfully and I do not see any error
>>>>>> with the training.out file.
>>>>>> But the tuning and decoding is ending up with Segmentation Fault
>>>>>> error when loading the phrase table and when it reaches 3% when loading.
>>>>>> I have attached the mert.out.
>>>>>> Would it be possible to know the reason, or which phrases in the
>>>>>> phrase table is causing the interruption in loading?
>>>>>>
>>>>>  Can you also send the moses.ini file you used, and the EXACT command
>>>>> you executed.
>>>>>
>>>>>
>>>>>>  Thanks,
>>>>>> Salameh
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 17, 2014 at 12:57 PM, Hieu Hoang <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>>  sorry, must have missed your email. Answers below
>>>>>>>
>>>>>>> On 16/10/14 20:21, Mohammad Salameh wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>> any answer to the above questions,
>>>>>>> Thanks,
>>>>>>> Salameh
>>>>>>>
>>>>>>> On Fri, Oct 10, 2014 at 10:11 AM, Mohammad Salameh <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi
>>>>>>>> I have few questions on how Moses system works
>>>>>>>>
>>>>>>>>  1) would it be possible to do a factored translation where
>>>>>>>> factors appear in the output but do not be part of the translation 
>>>>>>>> process.
>>>>>>>> For example, I have English surface form on source side and  Arabic 
>>>>>>>> surface
>>>>>>>> and their stems on the target side. I want to translate from English
>>>>>>>> surface form to Arabic surface, but also see the stems accompanying the
>>>>>>>> surface forms in the output.
>>>>>>>>  I have tried setting --translation-factors 0-0 , but only ended
>>>>>>>> up with the Arabic surface forms in the output.
>>>>>>>>
>>>>>>>   I'm not sure what you mean by 'not be part of the translation
>>>>>>> process'. If you want to see the stem in the output but you don't want 
>>>>>>> it
>>>>>>> in the translation table, then there needs to be some process that 
>>>>>>> generate
>>>>>>> the stem, given the target word. Moses has a crude solution - it is 
>>>>>>> called
>>>>>>> the generation step.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>  2) when translating sentences with moses , I assume that moses
>>>>>>>> adds the sentence boundary markers <s> </s> automatically. Would it be
>>>>>>>> possible to exclude these from the translation. I need to get 
>>>>>>>> translation
>>>>>>>> scores for chunks of input sentences which does not involve scores
>>>>>>>> generated based on <s> and </s> from LM or phrase table.
>>>>>>>>
>>>>>>>   Yes, it include <s> </s>. No, you can't exclude these from the
>>>>>>> translation process.
>>>>>>>
>>>>>>> I'm curious to know why you want to exclude these
>>>>>>>
>>>>>>>
>>>>>>>>  3) I added additional phrases to the phrase table. Should the
>>>>>>>> phrase table be sorted again and is it enough to do "LC_ALL=C sort " 
>>>>>>>> on the
>>>>>>>> PT to be used properly ?
>>>>>>>>
>>>>>>>   Yes, it needs to be sorted again. You must also make sure that
>>>>>>> the new phrases are not duplicates of existing phrases
>>>>>>>
>>>>>>>
>>>>>>>>  Thanks
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Moses-support mailing list
>>>>>>>> [email protected]
>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing 
>>>>>>> [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>> Hieu Hoang
>>>>> Research Associate
>>>>> University of Edinburgh
>>>>> http://www.hoang.co.uk/hieu
>>>>>
>>>>>
>>>>
>>>
>>>
>>>  --
>>> Hieu Hoang
>>> Research Associate
>>> University of Edinburgh
>>> http://www.hoang.co.uk/hieu
>>>
>>>
>>
>>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to