Thanks for the pointer. No apologies necessary. I'm betting it's just one of
those things that another set of eyeballs will see it immediately. Here's
the sequence of events that replicates (I hope) the demo described on the
web site.

start: Tokenize French
gzip -cd corpora/wmt08/training/news-commentary08.fr-en.fr.gz \
  | bin/tokenizer.perl -l fr > demo/corpus/news-commentary.tok.fr
finish: Tokenize French

start: Tokenize English
gzip -cd corpora/wmt08/training/news-commentary08.fr-en.en.gz \
  | bin/tokenizer.perl -l en > demo/corpus/news-commentary.tok.en
finish: Tokenize English

start: Limit sentence length
moses-scripts/scripts-20090913-1332/training/clean-corpus-n.perl \
  demo/corpus/news-commentary.tok fr en demo/corpus/news-commentary.clean 1
40
finish: Limit sentence length

start: Lowercase French training data
bin/lowercase.perl < demo/corpus/news-commentary.clean.fr \
  > demo/corpus/news-commentary.clean.lowercased.fr
finish: Lowercase French training data

start: Lowercase English training data
bin/lowercase.perl < demo/corpus/news-commentary.clean.en \
 > demo/corpus/news-commentary.clean.lowercased.en
finish: Lowercase English training data

start: Lowercase all English training data
bin/lowercase.perl < demo/corpus/news-commentary.tok.en \
 > demo/lm/news-commentary.lowercased.en
finish: Lowercase all English training data

start: Lowercase all French training data
bin/lowercase.perl < demo/corpus/news-commentary.tok.fr \
 > demo/lm/news-commentary.lowercased.fr
finish: Lowercase all French training data

start: Build trigram model for English
bin/i686/ngram-count -order 3 -interpolate -kndiscount -unk \
  -text demo/lm/news-commentary.lowercased.en -lm demo/lm/news-commentary.lm
finish: Build trigram model for English

start: Build trigram model for French
bin/i686/ngram-count -order 3 -interpolate -kndiscount -unk \
  -text demo/lm/news-commentary.lowercased.fr -lm demo/lm/news-commentary.lm
finish: Build trigram model for French

start: Build the language model
moses-scripts/scripts-20090913-1332/training/train-factored-phrase-model.perl
\
-scripts-root-dir /home/jkolen/trans/moses-scripts/scripts-20090913-1332/ \
-root-dir demo -corpus demo/lm/news-commentary.lowercased \
-f fr -e en -alignment grow-diag-final-and \
-reordering msd-bidirectional-fe \
-lm 0:3:/home/jkolen/trans/demo/lm/news-commentary.lm
finish: Build the language model


On Mon, Sep 14, 2009 at 7:07 AM, John Burger <[email protected]> wrote:

> John Kolen wrote:
>
>  Yes, the output log is reporting many zero length sentences. I must have
>> something misconfigured up stream.
>>
>
> I find the clean-corpus-n.perl script included with the Moses distribution
> to be useful here.  I have a target in my Makefile that looks like this:
>
> LENGTHLIMIT=40
> %.clean.fr %.clean.en: %.en %.fr
>        ./moses-scripts/scripts/training/clean-corpus-n.perl $* fr en
> $*.clean \
>                1 $(LENGTHLIMIT)
>
> If you don't use Makefiles, this might be something like this:
>
>  clean-corpus-n.perl data fr en data.clean 1 40
>
> This creates data.clean.en and .fr from data.en and .fr, filtering out
> pairs if either segment has length less than 1 (which solves your problem)
> or more than 40.  The script will also optionally take care of lowercasing
> the data, although we do that elsewhere.
>
> (Apologies if you already know about this.)
>
> - JB
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to