Mirko Plitt wrote:

> To close the loop on this one, in case anyone else runs into this.
>
> Turns out the reordering table contained a handful offending lines  
> which triggered the abort:
>
> ^K ||| ^K ||| 0.818182 0.0909091 0.0909091 0.818182 0.0909091  
> 0.0909091
> ^K ||| désactivés ||| 0.6 0.2 0.2 0.6 0.2 0.2
> ^K ||| en ||| 0.2 0.2 0.6 0.2 0.2 0.6
> ^K ||| la ||| 0.714286 0.142857 0.142857 0.714286 0.142857 0.142857

Based on recent experiences with corrupted data in the UN Chinese- 
English corpus, I now have something in my data prep pipeline that  
strips out any lines, on either side, with any ASCII "control"  
characters.  I do this in Python, but something like the following  
would work with Perl:

   perl -ne 'print m/[\000-\010\013\016-\037\177]/ ? "\n" : $_;'

(Control-K is \013.)  This replaces any lines containing such  
characters with an empty line.  I run the Python equivalent of this on  
both sides of my parallel data, separately.  Later, the clean-corpus- 
n.perl script in the Moses training pipeline strips out the entire  
pair, since one side has zero tokens.

Note that this works for ASCII or UTF8 data, but something else may be  
appropriate for other character encodings.

- John D. Burger
   MITRE


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to