Hi Ryan, Mathias is correct about preprocessing and not modifying the SMT model. His suspicion that removing spaces is even less of a problem” is not the case.
Is spaces were almost never used in Thai (like in Chinese for example), removing them would be trivial. However, Thai has over 50 “proper” uses for spaces, but word delimiter is not one. Add to that the disinclination for Thais to follow the rules (depending on the text type), and restoring proper spacing becomes a quite complex task. I see from your email that you’re with St Steven’s International School. I’m based in Bangkok. I’d be happy to meet you and discuss strategies offline if you like. Tom > On Dec 19, 2017, at 12:00 AM, [email protected] wrote: > > Date: Mon, 18 Dec 2017 09:30:55 +0100 > From: Mathias M?ller <[email protected]> > Subject: Re: [Moses-support] handling no-space languages in the > decoder > To: Ryan Coughlin <[email protected]> > Cc: [email protected] > > Hi Ryan > > Conceptually, the easiest way is to regard segmentation as a preprocessing > (and postprocessing) step that the core model has nothing to do with. You > should not bother to modify the decoder itself in this case. > > You will need a light wrapper for the Moses decoder. If you have a way to > segment the training data, you can do the same right before translation for > Thai-English and vice versa. For instance, a simple shell script. > > I suspect that removing spaces is even less of a problem. > > Regards > Mathias > >> On 17 Dec 2017, at 13:08, Ryan Coughlin <[email protected]> wrote: >> >> Hi all, >> >> I'm trying to use Moses to handle Thai-English translation. As far as I >> know, this never has been done. >> >> Thai is a language without spacing between words. Running a >> word-segmentation script to put spaces in between words is rather trivial. >> When training, I've pre-segmented the sentences with spaces between the >> words and the training seems to go OK. >> >> My problem is with the decoder. Is there a way to modify it so that a Thai >> sentence without spaces will be segmented to a sentence with spaces and then >> decoded to a proper English sentence. And the reverse would be an English >> sentence would be input and the Thai no space sentence would be output. Does >> that make sense? Sorry for the noob question. >> >> Thank you for any and all help that you may give me. >> >> take care, >> Ryan >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
