yes I only used the $SCRIPTS_ROOTDIR/training/clean-corpus-n.perl on both
the English and Arabic training set.
I tokenized the english part with moses tokenizer , where i used
the lowercase.perl and tokenizer.perl script.
the Arabic part is tokenized using MADA tool, and all  these characters
were normalized  <, > , and |  into Latin characters
I am expecting that some weird character appearing in the corpus.
When I have such a a case, usually the training script would print the
sentence with such characters and would stop building the phrase table.
What I usually previously did is that I manually removed the sentence from
the training data and alignment file and continue running the training
script, which ends successfully afterwards, with tuning and decoding.

In this case I can see that the training ended successfully, and all the
following files were generated :

419M    model/aligned.0,1.ar
224M    model/aligned.0.ar
256M    model/aligned.0.en
236M    model/aligned.grow-diag-final-and
1.2G    model/extract.0-0,1.inv.sorted.gz
1.2G    model/extract.0-0,1.sorted.gz
870M    model/extract.0-0.o.sorted.gz
92M     model/lex.0-0,1.e2f
92M     model/lex.0-0,1.f2e
4.0K    model/moses.ini
2.6G    model/phrase-table.0-0,1.gz
931M    model/reordering-table.0-0.wbe-msd-bidirectional-fe.gz

The error is only when the phrase table is loaded from the filtered
directory:

4.0K    filtered/info
260K    filtered/input.1002
4.0K    filtered/moses.ini
235M    filtered/phrase-table.0-0,1.1.1.gz
957M    filtered/reordering-table.0-0.wbe-msd-bidirectional-fe

I even tried decoding alone with out mert on the test set after filtering
the phrase table using filter-model-given-input.pl script, and it gave the
same error.
If that is the case, is there a way to know on which phrase pair did
loading fail ?

On Sat, Oct 18, 2014 at 10:58 AM, Hieu Hoang <[email protected]> wrote:

> the moses.ini looks ok. Did you clean your training data? Did you tokenize
> it with the moses tokenizer? Did you do anything to your phrase-table?
>
> On 18 October 2014 17:49, Mohammad Salameh <[email protected]> wrote:
>
>> Hi Hieu
>> Please find the moses.ini file attached
>> the exact commands are:
>>
>>
>>
>> ####TRAIN TM
>> $SCRIPTS_ROOTDIR/training/train-model.perl -root-dir $WORK
>> -external-bin-dir $MGIZA_HOME -corpus  $WORK/corpus/trn.fil -f en -e ar
>> -alignment grow-diag-final-and -max-phrase-length 8 --translation-factors
>> 0-0,1 --alignment-factors 0-1 -reordering msd-bidirectional-fe -mgiza -lm
>> 0:5:$WORK/lm/ar_surf.lm &>$WORK/training.out
>>
>> ####TUNE
>> mkdir $WORK/tuning/mertA
>> SCRIPTS_ROOTDIR/training/mert-moses.pl $WORK/tuning/dev.en $WORK/tuning/
>> dev.ar $MOSES $WORK/model/moses.ini --working-dir $WORK/tuning/mertA
>> --mertdir $MOSES_HOME/bin  --decoder-flags "-threads 11 -max-phrase-length
>> 8" --threads 11 &> $WORK/tuning/mertA/mert.out
>>
>>
>> Thanks,
>> Mohammad
>>
>> On Sat, Oct 18, 2014 at 6:20 AM, Hieu Hoang <[email protected]> wrote:
>>
>>> hi mohammad
>>>
>>>
>>> On 17 October 2014 21:45, Mohammad Salameh <[email protected]> wrote:
>>>
>>>> Thanks Hieu,
>>>> I wan to exclude the <s> because I want to translate chunks of source
>>>> sentences  with one model, and then add them  and their score as extra
>>>> feature to a phrase table of a different model.
>>>> So I don't want the sentence boundaries to be involved in the
>>>> translation.
>>>>
>>> I understand. Moses doesn't allow you to exclude <s>, however, if you
>>> don't want the score for this, then maybe you should write a feature
>>> function to subtract it from the score. Or modify an existing language
>>> model to not score <s>
>>>
>>>>
>>>> Also, I trained a factored system with  --translation-factors 0-0,1.
>>>> The training process ended successfully and I do not see any error with the
>>>> training.out file.
>>>> But the tuning and decoding is ending up with Segmentation Fault error
>>>> when loading the phrase table and when it reaches 3% when loading.
>>>> I have attached the mert.out.
>>>> Would it be possible to know the reason, or which phrases in the phrase
>>>> table is causing the interruption in loading?
>>>>
>>> Can you also send the moses.ini file you used, and the EXACT command you
>>> executed.
>>>
>>>
>>>> Thanks,
>>>> Salameh
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 12:57 PM, Hieu Hoang <[email protected]>
>>>> wrote:
>>>>
>>>>>  sorry, must have missed your email. Answers below
>>>>>
>>>>> On 16/10/14 20:21, Mohammad Salameh wrote:
>>>>>
>>>>> Hi,
>>>>> any answer to the above questions,
>>>>> Thanks,
>>>>> Salameh
>>>>>
>>>>> On Fri, Oct 10, 2014 at 10:11 AM, Mohammad Salameh <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi
>>>>>> I have few questions on how Moses system works
>>>>>>
>>>>>>  1) would it be possible to do a factored translation where factors
>>>>>> appear in the output but do not be part of the translation process. For
>>>>>> example, I have English surface form on source side and  Arabic surface 
>>>>>> and
>>>>>> their stems on the target side. I want to translate from English surface
>>>>>> form to Arabic surface, but also see the stems accompanying the surface
>>>>>> forms in the output.
>>>>>>  I have tried setting --translation-factors 0-0 , but only ended up
>>>>>> with the Arabic surface forms in the output.
>>>>>>
>>>>>   I'm not sure what you mean by 'not be part of the translation
>>>>> process'. If you want to see the stem in the output but you don't want it
>>>>> in the translation table, then there needs to be some process that 
>>>>> generate
>>>>> the stem, given the target word. Moses has a crude solution - it is called
>>>>> the generation step.
>>>>>
>>>>>
>>>>>>
>>>>>>  2) when translating sentences with moses , I assume that moses adds
>>>>>> the sentence boundary markers <s> </s> automatically. Would it be 
>>>>>> possible
>>>>>> to exclude these from the translation. I need to get translation scores 
>>>>>> for
>>>>>> chunks of input sentences which does not involve scores generated based 
>>>>>> on
>>>>>> <s> and </s> from LM or phrase table.
>>>>>>
>>>>>   Yes, it include <s> </s>. No, you can't exclude these from the
>>>>> translation process.
>>>>>
>>>>> I'm curious to know why you want to exclude these
>>>>>
>>>>>
>>>>>>  3) I added additional phrases to the phrase table. Should the
>>>>>> phrase table be sorted again and is it enough to do "LC_ALL=C sort " on 
>>>>>> the
>>>>>> PT to be used properly ?
>>>>>>
>>>>>   Yes, it needs to be sorted again. You must also make sure that the
>>>>> new phrases are not duplicates of existing phrases
>>>>>
>>>>>
>>>>>>  Thanks
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> [email protected]
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing 
>>>>> [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Hieu Hoang
>>> Research Associate
>>> University of Edinburgh
>>> http://www.hoang.co.uk/hieu
>>>
>>>
>>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to