Hi Jian
The logic looks correct to me. If the domains file has been provided, we
then need to check if the sentence is in-domain. If the domains file is
not provided, then all sentences are considered out-of-domain.
The fact that all scores are 99999 means that the MML filter is seeing
all your sentences as in-domain. It could be that something went wrong
during corpus preprocessing, or during the creation of the domains file
(/home/mml/mml-test/experiment/model/domains.1). Do the lengths in the
domains file match the lengths of your in and out corpora?
cheers - Barry
On 25/01/14 03:29, jian zhang wrote:
Hi Barry, I don't not understand line *if (defined($filter_domains) &&
!&check_sentence_filtered($i))* at mml-score.perl, before computing
the bilingual cross-entropy difference,
Should it not be *if (!defined($filter_domains) &&
!&check_sentence_filtered($i)) *?
Regards,
Jian Zhang
On Fri, Jan 24, 2014 at 10:27 PM, jian zhang <[email protected]
<mailto:[email protected]>> wrote:
Hi Barry,
All the scores are 99999 in that file.
Thanks,
Jian
On Fri, Jan 24, 2014 at 3:51 PM, Barry Haddow
<[email protected] <mailto:[email protected]>>
wrote:
Hi Jian
This is a bit suspect:
2014-01-24 14:17:26,276 Retaining at least 0 entries and
ignoring 2075137
Are the scores in this file sensible (or are they all the same?)
/home/mml/mml-test/experiment/training/corpus-mml-score.1
cheers - Barry
On 24/01/14 14:53, jian zhang wrote:
Hi,
I got error of IndexError: list index out of range at the
TRAINING_mml-filter-before-wa step.
I had read the post at
https://www.mail-archive.com/[email protected]/msg08767.html,
however I still can not figure out what is wrong.
The full error is
general:strategy = Score
general:source_language = fr
general:target_language = en
general:input_stem =
/home/mml/mml-test/experiment/training/corpus.1
general:output_stem =
/home/mml/mml-test/experiment/training/corpus-mml.1
general:domain_file =
/home/mml/mml-test/experiment/model/domains.1
general:domain_file_out =
/home/mml/mml-test/experiment/training/corpus-mml.1
score:score_file =
/home/mml/mml-test/experiment/training/corpus-mml-score.1
score:proportion = 0.9
2014-01-24 14:17:26,276 Retaining at least 0 entries and
ignoring 2075137
Traceback (most recent call last):
File
"/home/tools/mosesdecoder/scripts/ems/support/mml-filter.py",
line 156, in <module>
main()
File
"/home/tools/mosesdecoder/scripts/ems/support/mml-filter.py",
line 111, in main
strategy = strategy_class(config)
File
"/home/tools/mosesdecoder/scripts/ems/support/mml-filter.py",
line 72, in __init__
[float(line[:-1]) for line in open(self.score_file)],
reverse=True)[ignore_count + count]
IndexError: list index out of range
And my ems configuration file has:
#################################################################
# PARALLEL CORPUS PREPARATION:
# create a tokenized, sentence-aligned corpus, ready for
training
[CORPUS]
#in-domain parallel corpus
[CORPUS:in]
clean-stem = $training-in-domain-corpus
[CORPUS:out]
#out-domain parallel corpus
clean-stem = $training-out-domain-corpus
#################################################################
# LANGUAGE MODEL TRAINING
[LM]
[LM:lm]
type = 8
lm = $language-model
#################################################################
# MODIFIED MOORE LEWIS FILTERING
[MML]
lm-training = $srilm-dir/ngram-count
lm-settings = "-interpolate -kndiscount -unk"
lm-binarizer = $moses-src-dir/bin/build_binary
lm-query = $moses-src-dir/bin/query
order = 5
### in-/out-of-domain source/target corpora to train the 4
language model
#
# in-domain parallel corpus
indomain-stem = [CORPUS:in:clean-split-stem]
# out-of-domain parallel corpus
outdomain-stem = [CORPUS:out:clean-split-stem]
# settings: number of lines sampled from the corpora to
train each language model on
settings = "--line-count 100000"
#################################################################
# TRANSLATION MODEL TRAINING
[TRAINING]
script = $moses-script-dir/training/train-model.perl
training-options = "-mgiza -mgiza-cpus 12
-sort-buffer-size 16G -sort-compress gzip -sort-parallel
12 -cores 12"
parallel = yes
alignment-symmetrization-method = grow-diag-final-and
lexicalized-reordering = msd-bidirectional-fe
score-settings = "--GoodTuring"
include-word-alignment-in-rules = yes
#space separated all out-of domain corpora to be filtered
mml-filter-corpora = out
mml-before-wa = "-proportion 0.9"
#####################################################
Thanks.
Jian Zhang
_______________________________________________
Moses-support mailing list
[email protected] <mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
--
Jian Zhang
Centre for Next Generation Localisation (CNGL)
<http://www.cngl.ie/index.html>
Dublin City University <http://www.dcu.ie/>
--
Jian Zhang
Centre for Next Generation Localisation (CNGL)
<http://www.cngl.ie/index.html>
Dublin City University <http://www.dcu.ie/>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support