Plain tokenized text is good enough. It may even work as a tokenizer(?) if none is available. There is no specific notion of "infix themes" though. The segmentation is purely frequency-based, no linguistic motivation there, but it may just work.
It's easy enough, just run it and take a look at the results. Even it looks strange to you it may be worth to do a test training anyway. As I said, for Russian->English I get a nice improvement for patent data. On 01.02.2016 19:30, Michael Joyner wrote: > So how does that work? > > it just takes all the words from the corpus and guesses "infix themes" > ? Or do I have to supply pre-tagged data? > > On Mon, Feb 1, 2016 at 9:04 AM, Rico Sennrich <rico.sennr...@gmx.ch > <mailto:rico.sennr...@gmx.ch>> wrote: > > Hi Mike, > > here's a link to the tool Marcin mentioned: > https://github.com/rsennrich/subword-nmt > > I haven't tried it on phrase-based MT myself, but feel free to > give it a try. > > You could also try other unsupervised morpheme segmenters like > morfessor: https://github.com/aalto-speech/morfessor > > I don't know if there's any segmentation methods specific for > Cherokee. > > best wishes, > Rico > > > On 01.02.2016 13:31, Marcin Junczys-Dowmunt wrote: >> >> Hi Mike, >> >> Maybe take a look at Rico's tool for handling unknown words in >> neural machine translation. I have been playing around with that >> for Russian-English and standard phrase-based SMT with some >> success. I am just not sure if your small corpora will be enough >> to learn useful segmentations though. >> >> It's an unsupervised method for word segmentation. For >> Russian-English I created a code dictionary of the 100,000 >> most-frequent segments per language. Unseen tokens will get >> segmented. The segmentation is not neccessarily similar to a >> linguisticly correct segmentation, though. You will probably want >> to try smaller numbers. >> >> Best, >> >> Marcin >> >> W dniu 2016-02-01 14:12, Michael Joyner napisaĆ(a): >> >>> I am trying to use Moses with Cherokee using the New Testament >>> and Genesis as primary corpus. I am feeding it the WEB, BBE as >>> source English texts at the moment. >>> >>> As Cherokee uses bound pronouns and no articles and has almost >>> nil preposition analogues, (these features are mostly verb >>> infixes), is there a technique for corpus adjustment that can be >>> done to improve the phrase mapping between Cherokee and English? >>> >>> I am currently doing Cherokee => English. >>> Thanks, Mike >>> -- >>> >>> WEB: World English Bible (Public Domain) >>> BBE: Basic English Bible (Public Domain) >>> >>> * Learn to the Cherokee language: http://jalagigawoni.gnomio.com/ >>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> >> _______________________________________________ >> Moses-support mailing list >> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu <mailto:Moses-support@mit.edu> > http://mailman.mit.edu/mailman/listinfo/moses-support > > > > > -- > > * Learn to the Cherokee language: http://jalagigawoni.gnomio.com/ > > > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support