Dear Moses Community, I want to train Moses with byte-pair encoding tokenization (BPE, https://github.com/rsennrich/subword-nmt). I plan to do it "by hand" without the EMS.
Is there any problem with the idea? Would it be Ok just to apply BPE after tokenization, truecasing, etc and then go on with the rest of the typical steps? Is there any gotcha I should take into account? I have only identified as potential pitfall that I have to clean the corpus with clean-corpus-n.perl after applying BPE in order not to reach the maximum fertility 9 for mgiza. Any success/failure experiences doing similar stuff are also very welcome. Thanks, Noe.
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
