Mirko Plitt wrote: > You're right about the null byte, and given the time I've been > spending on this training I'm definitely interested in any shortcut > that would avoid my having to start from scratch! > > The data I'm training on is not Chinese UN data but a pretty large > dump of Microsoft software strings in English and French.
I'll include the list in case this is of general interest. I don't have this written down anywhere, but I =think= this is what I've done the few times this has bitten me. First go into working-dir/ model and delete everything but the following: aligned.grow-diag-final-and aligned.0.fr aligned.0.en lex.0-0.n2f lex.0-0.f2n (or move the rest into a subdir if you're paranoid.) Now run this fragment of Perl: perl -i.BAD -pe 's/[\000]/NULLBYTE/g;' aligned.0* lex.0* This will replace every null byte in those four files, saving the old version out to *.BAD. (This may be overkill, for instance if only the foreign side has the problem.) Now restart the moses training script with the same invocation as before, but tell it to start at step 5: train-factored-phrase-model.perl ... --first-step 5 This should skip all the corpus munging etc., most importantly the time-consuming GIZA steps. Hope this works out for you! - John Burger MITRE _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
