John Kolen wrote:
> Yes, the output log is reporting many zero length sentences. I must
> have something misconfigured up stream.
I find the clean-corpus-n.perl script included with the Moses
distribution to be useful here. I have a target in my Makefile that
looks like this:
LENGTHLIMIT=40
%.clean.fr %.clean.en: %.en %.fr
./moses-scripts/scripts/training/clean-corpus-n.perl $* fr en
$*.clean \
1 $(LENGTHLIMIT)
If you don't use Makefiles, this might be something like this:
clean-corpus-n.perl data fr en data.clean 1 40
This creates data.clean.en and .fr from data.en and .fr, filtering out
pairs if either segment has length less than 1 (which solves your
problem) or more than 40. The script will also optionally take care
of lowercasing the data, although we do that elsewhere.
(Apologies if you already know about this.)
- JB
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support