ok, it looks like your data is cleaned, specifically, that the characters | < > have been escaped.

i'm not really sure why it segfaults. Is there a disk space problem?

You may have to run it with a debugger to find out. If you still can't find the problem after a few days, please make your filtered model files available for download and I'll try and debug it for you

On 18/10/14 20:10, Mohammad Salameh wrote:
yes I only used the $SCRIPTS_ROOTDIR/training/clean-corpus-n.perl on both the English and Arabic training set. I tokenized the english part with moses tokenizer , where i used the lowercase.perl and tokenizer.perl script. the Arabic part is tokenized using MADA tool, and all these characters were normalized <, > , and | into Latin characters
I am expecting that some weird character appearing in the corpus.
When I have such a a case, usually the training script would print the sentence with such characters and would stop building the phrase table. What I usually previously did is that I manually removed the sentence from the training data and alignment file and continue running the training script, which ends successfully afterwards, with tuning and decoding.

In this case I can see that the training ended successfully, and all the following files were generated :

419M    model/aligned.0,1.ar
224M    model/aligned.0.ar
256M    model/aligned.0.en
236M    model/aligned.grow-diag-final-and
1.2G    model/extract.0-0,1.inv.sorted.gz
1.2G    model/extract.0-0,1.sorted.gz
870M    model/extract.0-0.o.sorted.gz
92M     model/lex.0-0,1.e2f
92M     model/lex.0-0,1.f2e
4.0K    model/moses.ini
2.6G    model/phrase-table.0-0,1.gz
931M    model/reordering-table.0-0.wbe-msd-bidirectional-fe.gz

The error is only when the phrase table is loaded from the filtered directory:

4.0K    filtered/info
260K    filtered/input.1002
4.0K    filtered/moses.ini
235M    filtered/phrase-table.0-0,1.1.1.gz
957M    filtered/reordering-table.0-0.wbe-msd-bidirectional-fe

I even tried decoding alone with out mert on the test set after filtering the phrase table using filter-model-given-input.pl <http://filter-model-given-input.pl> script, and it gave the same error. If that is the case, is there a way to know on which phrase pair did loading fail ?


On Sat, Oct 18, 2014 at 10:58 AM, Hieu Hoang <[email protected] <mailto:[email protected]>> wrote:

    the moses.ini looks ok. Did you clean your training data? Did you
    tokenize it with the moses tokenizer? Did you do anything to your
    phrase-table?

    On 18 October 2014 17:49, Mohammad Salameh <[email protected]
    <mailto:[email protected]>> wrote:

        Hi Hieu
        Please find the moses.ini file attached
        the exact commands are:



        ####TRAIN TM
        $SCRIPTS_ROOTDIR/training/train-model.perl -root-dir $WORK
        -external-bin-dir $MGIZA_HOME -corpus  $WORK/corpus/trn.fil -f
        en -e ar -alignment grow-diag-final-and -max-phrase-length 8
        --translation-factors 0-0,1 --alignment-factors 0-1
        -reordering msd-bidirectional-fe -mgiza -lm
        0:5:$WORK/lm/ar_surf.lm &>$WORK/training.out

        ####TUNE
        mkdir $WORK/tuning/mertA
        SCRIPTS_ROOTDIR/training/mert-moses.pl
        <http://mert-moses.pl/> $WORK/tuning/dev.en
        $WORK/tuning/dev.ar <http://dev.ar/> $MOSES
        $WORK/model/moses.ini --working-dir $WORK/tuning/mertA
        --mertdir $MOSES_HOME/bin  --decoder-flags "-threads 11
        -max-phrase-length 8" --threads 11 &> $WORK/tuning/mertA/mert.out


        Thanks,
        Mohammad

        On Sat, Oct 18, 2014 at 6:20 AM, Hieu Hoang
        <[email protected] <mailto:[email protected]>> wrote:

            hi mohammad


            On 17 October 2014 21:45, Mohammad Salameh
            <[email protected] <mailto:[email protected]>> wrote:

                Thanks Hieu,
                I wan to exclude the <s> because I want to translate
                chunks of source sentences  with one model, and then
                add them  and their score as extra feature to a phrase
                table of a different model.
                So I don't want the sentence boundaries to be involved
                in the translation.

            I understand. Moses doesn't allow you to exclude <s>,
            however, if you don't want the score for this, then maybe
            you should write a feature function to subtract it from
            the score. Or modify an existing language model to not
            score <s>


                Also, I trained a factored system with
                --translation-factors 0-0,1. The training process
                ended successfully and I do not see any error with the
                training.out file.
                But the tuning and decoding is ending up with
                Segmentation Fault error when loading the phrase table
                and when it reaches 3% when loading.
                I have attached the mert.out.
                Would it be possible to know the reason, or which
                phrases in the phrase table is causing the
                interruption in loading?

            Can you also send the moses.ini file you used, and the
            EXACT command you executed.

                Thanks,
                Salameh






                On Fri, Oct 17, 2014 at 12:57 PM, Hieu Hoang
                <[email protected] <mailto:[email protected]>> wrote:

                    sorry, must have missed your email. Answers below

                    On 16/10/14 20:21, Mohammad Salameh wrote:
                    Hi,
                    any answer to the above questions,
                    Thanks,
                    Salameh

                    On Fri, Oct 10, 2014 at 10:11 AM, Mohammad
                    Salameh <[email protected]
                    <mailto:[email protected]>> wrote:

                        Hi
                        I have few questions on how Moses system works

                        1) would it be possible to do a factored
                        translation where factors appear in the
                        output but do not be part of the translation
                        process. For example, I have English surface
                        form on source side and  Arabic surface and
                        their stems on the target side. I want to
                        translate from English surface form to Arabic
                        surface, but also see the stems accompanying
                        the surface forms in the output.
                        I have tried setting --translation-factors
                        0-0 , but only ended up with the Arabic
                        surface forms in the output.

                    I'm not sure what you mean by 'not be part of the
                    translation process'. If you want to see the stem
                    in the output but you don't want it in the
                    translation table, then there needs to be some
                    process that generate the stem, given the target
                    word. Moses has a crude solution - it is called
                    the generation step.



                        2) when translating sentences with moses , I
                        assume that moses adds the sentence boundary
                        markers <s> </s> automatically. Would it be
                        possible to exclude these from the
                        translation. I need to get translation scores
                        for chunks of input sentences which does not
                        involve scores generated based on <s> and
                        </s> from LM or phrase table.

                    Yes, it include <s> </s>. No, you can't exclude
                    these from the translation process.

                    I'm curious to know why you want to exclude these


                        3) I added additional phrases to the phrase
                        table. Should the phrase table be sorted
                        again and is it enough to do "LC_ALL=C sort "
                        on the PT to be used properly ?

                    Yes, it needs to be sorted again. You must also
                    make sure that the new phrases are not duplicates
                    of existing phrases


                        Thanks

                        _______________________________________________
                        Moses-support mailing list
                        [email protected]
                        <mailto:[email protected]>
                        http://mailman.mit.edu/mailman/listinfo/moses-support




                    _______________________________________________
                    Moses-support mailing list
                    [email protected]  <mailto:[email protected]>
                    http://mailman.mit.edu/mailman/listinfo/moses-support





-- Hieu Hoang
            Research Associate
            University of Edinburgh
            http://www.hoang.co.uk/hieu





-- Hieu Hoang
    Research Associate
    University of Edinburgh
    http://www.hoang.co.uk/hieu



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to