I tried both... very poor results. Cherokee is a bit morphological for sounds and uses a Syllabary which obscures some of these things to the machine.
But.. I came up with a hopeful help by creating a small program which does some simple infix guessing and splitting for the most relevant infixes (pronouns, benefactive, etc). In case someone might find what I have done useful: https://github.com/mjoyner-vbservices-net/CherokeeAffixSplitter It would be better if I were to take some known valid verb entries and generate the needed permutations to split against, but, hopefully this will be enough to help. On Mon, Feb 1, 2016 at 3:07 PM, Marcin Junczys-Dowmunt <[email protected]> wrote: > Plain tokenized text is good enough. It may even work as a tokenizer(?) > if none is available. There is no specific notion of "infix themes" > though. The segmentation is purely frequency-based, no linguistic > motivation there, but it may just work. > > It's easy enough, just run it and take a look at the results. Even it > looks strange to you it may be worth to do a test training anyway. As I > said, for Russian->English I get a nice improvement for patent data. > > On 01.02.2016 19:30, Michael Joyner wrote: > > So how does that work? > > > > it just takes all the words from the corpus and guesses "infix themes" > > ? Or do I have to supply pre-tagged data? > > > > On Mon, Feb 1, 2016 at 9:04 AM, Rico Sennrich <[email protected] > > <mailto:[email protected]>> wrote: > > > > Hi Mike, > > > > here's a link to the tool Marcin mentioned: > > https://github.com/rsennrich/subword-nmt > > > > I haven't tried it on phrase-based MT myself, but feel free to > > give it a try. > > > > You could also try other unsupervised morpheme segmenters like > > morfessor: https://github.com/aalto-speech/morfessor > > > > I don't know if there's any segmentation methods specific for > > Cherokee. > > > > best wishes, > > Rico > > > > > > On 01.02.2016 13:31, Marcin Junczys-Dowmunt wrote: > >> > >> Hi Mike, > >> > >> Maybe take a look at Rico's tool for handling unknown words in > >> neural machine translation. I have been playing around with that > >> for Russian-English and standard phrase-based SMT with some > >> success. I am just not sure if your small corpora will be enough > >> to learn useful segmentations though. > >> > >> It's an unsupervised method for word segmentation. For > >> Russian-English I created a code dictionary of the 100,000 > >> most-frequent segments per language. Unseen tokens will get > >> segmented. The segmentation is not neccessarily similar to a > >> linguisticly correct segmentation, though. You will probably want > >> to try smaller numbers. > >> > >> Best, > >> > >> Marcin > >> > >> W dniu 2016-02-01 14:12, Michael Joyner napisaĆ(a): > >> > >>> I am trying to use Moses with Cherokee using the New Testament > >>> and Genesis as primary corpus. I am feeding it the WEB, BBE as > >>> source English texts at the moment. > >>> > >>> As Cherokee uses bound pronouns and no articles and has almost > >>> nil preposition analogues, (these features are mostly verb > >>> infixes), is there a technique for corpus adjustment that can be > >>> done to improve the phrase mapping between Cherokee and English? > >>> > >>> I am currently doing Cherokee => English. > >>> Thanks, Mike > >>> -- > >>> > >>> WEB: World English Bible (Public Domain) > >>> BBE: Basic English Bible (Public Domain) > >>> > >>> * Learn to the Cherokee language: > http://jalagigawoni.gnomio.com/ > >>> > >>> > >>> _______________________________________________ > >>> Moses-support mailing list > >>> [email protected] <mailto:[email protected]> > >>> http://mailman.mit.edu/mailman/listinfo/moses-support > >> > >> > >> > >> _______________________________________________ > >> Moses-support mailing list > >> [email protected] <mailto:[email protected]> > >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > > > > _______________________________________________ > > Moses-support mailing list > > [email protected] <mailto:[email protected]> > > http://mailman.mit.edu/mailman/listinfo/moses-support > > > > > > > > > > -- > > > > * Learn to the Cherokee language: http://jalagigawoni.gnomio.com/ > > > > > > > > _______________________________________________ > > Moses-support mailing list > > [email protected] > > http://mailman.mit.edu/mailman/listinfo/moses-support > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > -- - Learn to the Cherokee language: http://jalagigawoni.gnomio.com/
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
