Mirko Plitt wrote: > To close the loop on this one, in case anyone else runs into this. > > Turns out the reordering table contained a handful offending lines > which triggered the abort: > > ^K ||| ^K ||| 0.818182 0.0909091 0.0909091 0.818182 0.0909091 > 0.0909091 > ^K ||| désactivés ||| 0.6 0.2 0.2 0.6 0.2 0.2 > ^K ||| en ||| 0.2 0.2 0.6 0.2 0.2 0.6 > ^K ||| la ||| 0.714286 0.142857 0.142857 0.714286 0.142857 0.142857
Based on recent experiences with corrupted data in the UN Chinese- English corpus, I now have something in my data prep pipeline that strips out any lines, on either side, with any ASCII "control" characters. I do this in Python, but something like the following would work with Perl: perl -ne 'print m/[\000-\010\013\016-\037\177]/ ? "\n" : $_;' (Control-K is \013.) This replaces any lines containing such characters with an empty line. I run the Python equivalent of this on both sides of my parallel data, separately. Later, the clean-corpus- n.perl script in the Moses training pipeline strips out the entire pair, since one side has zero tokens. Note that this works for ASCII or UTF8 data, but something else may be appropriate for other character encodings. - John D. Burger MITRE _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
