Mirko Plitt wrote:

> You're right about the null byte, and given the time I've been  
> spending on this training I'm definitely interested in any shortcut  
> that would avoid my having to start from scratch!
>
> The data I'm training on is not Chinese UN data but a pretty large  
> dump of Microsoft software strings in English and French.

I'll include the list in case this is of general interest.

I don't have this written down anywhere, but I =think= this is what  
I've done the few times this has bitten me.  First go into working-dir/ 
model and delete everything but the following:

   aligned.grow-diag-final-and
   aligned.0.fr
   aligned.0.en
   lex.0-0.n2f
   lex.0-0.f2n

(or move the rest into a subdir if you're paranoid.)

Now run this fragment of Perl:

   perl -i.BAD -pe 's/[\000]/NULLBYTE/g;' aligned.0* lex.0*

This will replace every null byte in those four files, saving the old  
version out to *.BAD.  (This may be overkill, for instance if only the  
foreign side has the problem.)

Now restart the moses training script with the same invocation as  
before, but tell it to start at step 5:

   train-factored-phrase-model.perl ... --first-step 5

This should skip all the corpus munging etc., most importantly the  
time-consuming GIZA steps.

Hope this works out for you!

- John Burger
   MITRE
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to