John Kolen wrote:

> Yes, the output log is reporting many zero length sentences. I must  
> have something misconfigured up stream.

I find the clean-corpus-n.perl script included with the Moses  
distribution to be useful here.  I have a target in my Makefile that  
looks like this:

LENGTHLIMIT=40
%.clean.fr %.clean.en: %.en %.fr
        ./moses-scripts/scripts/training/clean-corpus-n.perl $* fr en  
$*.clean \
                1 $(LENGTHLIMIT)

If you don't use Makefiles, this might be something like this:

   clean-corpus-n.perl data fr en data.clean 1 40

This creates data.clean.en and .fr from data.en and .fr, filtering out  
pairs if either segment has length less than 1 (which solves your  
problem) or more than 40.  The script will also optionally take care  
of lowercasing the data, although we do that elsewhere.

(Apologies if you already know about this.)

- JB
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to