Hi Jian

The logic looks correct to me. If the domains file has been provided, we then need to check if the sentence is in-domain. If the domains file is not provided, then all sentences are considered out-of-domain.

The fact that all scores are 99999 means that the MML filter is seeing all your sentences as in-domain. It could be that something went wrong during corpus preprocessing, or during the creation of the domains file (/home/mml/mml-test/experiment/model/domains.1). Do the lengths in the domains file match the lengths of your in and out corpora?

cheers - Barry

On 25/01/14 03:29, jian zhang wrote:
Hi Barry, I don't not understand line *if (defined($filter_domains) && !&check_sentence_filtered($i))* at mml-score.perl, before computing the bilingual cross-entropy difference, Should it not be *if (!defined($filter_domains) && !&check_sentence_filtered($i)) *?

Regards,

Jian Zhang




On Fri, Jan 24, 2014 at 10:27 PM, jian zhang <[email protected] <mailto:[email protected]>> wrote:

    Hi Barry,

    All the scores are 99999 in that file.

    Thanks,


    Jian


    On Fri, Jan 24, 2014 at 3:51 PM, Barry Haddow
    <[email protected] <mailto:[email protected]>>
    wrote:

        Hi Jian

        This is a bit suspect:


        2014-01-24 14:17:26,276 Retaining at least 0 entries and
        ignoring 2075137

        Are the scores in this file sensible (or are they all the same?)

        /home/mml/mml-test/experiment/training/corpus-mml-score.1

        cheers - Barry


        On 24/01/14 14:53, jian zhang wrote:

            Hi,

            I got error of IndexError: list index out of range at the
            TRAINING_mml-filter-before-wa step.

            I had read the post at
            https://www.mail-archive.com/[email protected]/msg08767.html,
            however I still can not figure out what is wrong.

            The full error is

            general:strategy = Score
            general:source_language = fr
            general:target_language = en
            general:input_stem =
            /home/mml/mml-test/experiment/training/corpus.1
            general:output_stem =
            /home/mml/mml-test/experiment/training/corpus-mml.1
            general:domain_file =
            /home/mml/mml-test/experiment/model/domains.1
            general:domain_file_out =
            /home/mml/mml-test/experiment/training/corpus-mml.1
            score:score_file =
            /home/mml/mml-test/experiment/training/corpus-mml-score.1
            score:proportion = 0.9

            2014-01-24 14:17:26,276 Retaining at least 0 entries and
            ignoring 2075137
            Traceback (most recent call last):
              File
            "/home/tools/mosesdecoder/scripts/ems/support/mml-filter.py",
            line 156, in <module>
                main()
              File
            "/home/tools/mosesdecoder/scripts/ems/support/mml-filter.py",
            line 111, in main
                strategy = strategy_class(config)
              File
            "/home/tools/mosesdecoder/scripts/ems/support/mml-filter.py",
            line 72, in __init__
                [float(line[:-1]) for line in open(self.score_file)],
            reverse=True)[ignore_count + count]
            IndexError: list index out of range

            And my ems configuration file has:

            #################################################################
            # PARALLEL CORPUS PREPARATION:
            # create a tokenized, sentence-aligned corpus, ready for
            training

            [CORPUS]

            #in-domain parallel corpus
            [CORPUS:in]
            clean-stem = $training-in-domain-corpus

            [CORPUS:out]
            #out-domain parallel corpus
            clean-stem = $training-out-domain-corpus


            #################################################################
            # LANGUAGE MODEL TRAINING
            [LM]
            [LM:lm]
            type = 8
            lm = $language-model
            #################################################################
            # MODIFIED MOORE LEWIS FILTERING

            [MML]

            lm-training = $srilm-dir/ngram-count
            lm-settings = "-interpolate -kndiscount -unk"
            lm-binarizer = $moses-src-dir/bin/build_binary
            lm-query = $moses-src-dir/bin/query
            order = 5

            ### in-/out-of-domain source/target corpora to train the 4
            language model
            #
            # in-domain parallel corpus
            indomain-stem = [CORPUS:in:clean-split-stem]

            # out-of-domain parallel corpus
            outdomain-stem = [CORPUS:out:clean-split-stem]

            # settings: number of lines sampled from the corpora to
            train each language model on
            settings = "--line-count 100000"

            #################################################################
            # TRANSLATION MODEL TRAINING
            [TRAINING]
            script = $moses-script-dir/training/train-model.perl
            training-options = "-mgiza -mgiza-cpus 12
            -sort-buffer-size 16G -sort-compress gzip -sort-parallel
            12 -cores 12"
            parallel = yes
            alignment-symmetrization-method = grow-diag-final-and
            lexicalized-reordering = msd-bidirectional-fe
            score-settings = "--GoodTuring"
            include-word-alignment-in-rules = yes

            #space separated all out-of domain corpora to be filtered
            mml-filter-corpora = out
            mml-before-wa = "-proportion 0.9"

            #####################################################

            Thanks.


            Jian Zhang


            _______________________________________________
            Moses-support mailing list
            [email protected] <mailto:[email protected]>
            http://mailman.mit.edu/mailman/listinfo/moses-support



-- The University of Edinburgh is a charitable body, registered in
        Scotland, with registration number SC005336.

-- Jian Zhang
        Centre for Next Generation Localisation (CNGL)
        <http://www.cngl.ie/index.html>
        Dublin City University <http://www.dcu.ie/>






--
Jian Zhang
Centre for Next Generation Localisation (CNGL) <http://www.cngl.ie/index.html>
Dublin City University <http://www.dcu.ie/>


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to