Re: [Moses-support] Polysynthetic languages?

Michael Joyner Sat, 06 Feb 2016 14:18:11 -0800

I tried both... very poor results. Cherokee is a bit morphological for
sounds and uses a Syllabary which obscures some of these things to the
machine.


But.. I came up with a hopeful help by creating a small program which does
some simple infix guessing and splitting for the most relevant infixes
(pronouns, benefactive, etc).

In case someone might find what I have done useful:

https://github.com/mjoyner-vbservices-net/CherokeeAffixSplitter

It would be better if I were to take some known valid verb entries and
generate the needed permutations to split against, but, hopefully this will
be enough to help.


On Mon, Feb 1, 2016 at 3:07 PM, Marcin Junczys-Dowmunt <[email protected]>
wrote:

> Plain tokenized text is good enough. It may even work as a tokenizer(?)
> if none is available. There is no specific notion of "infix themes"
> though. The segmentation is purely frequency-based, no linguistic
> motivation there, but it may just work.
>
> It's easy enough, just run it and take a look at the results. Even it
> looks strange to you it may be worth to do a test training anyway. As I
> said, for Russian->English I get a nice improvement for patent data.
>
> On 01.02.2016 19:30, Michael Joyner wrote:
> > So how does that work?
> >
> > it just takes all the words from the corpus and guesses "infix themes"
> > ? Or do I have to supply pre-tagged data?
> >
> > On Mon, Feb 1, 2016 at 9:04 AM, Rico Sennrich <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Hi Mike,
> >
> >     here's a link to the tool Marcin mentioned:
> >     https://github.com/rsennrich/subword-nmt
> >
> >     I haven't tried it on phrase-based MT myself, but feel free to
> >     give it a try.
> >
> >     You could also try other unsupervised morpheme segmenters like
> >     morfessor: https://github.com/aalto-speech/morfessor
> >
> >     I don't know if there's any segmentation methods specific for
> >     Cherokee.
> >
> >     best wishes,
> >     Rico
> >
> >
> >     On 01.02.2016 13:31, Marcin Junczys-Dowmunt wrote:
> >>
> >>     Hi Mike,
> >>
> >>     Maybe take a look at Rico's tool for handling unknown words in
> >>     neural machine translation. I have been playing around with that
> >>     for Russian-English and standard phrase-based SMT with some
> >>     success. I am just not sure if your small corpora will be enough
> >>     to learn useful segmentations though.
> >>
> >>     It's an unsupervised method for word segmentation. For
> >>     Russian-English I created a code dictionary of the 100,000
> >>     most-frequent segments per language. Unseen tokens will get
> >>     segmented. The segmentation is not neccessarily similar to a
> >>     linguisticly correct segmentation, though. You will probably want
> >>     to try smaller numbers.
> >>
> >>     Best,
> >>
> >>     Marcin
> >>
> >>     W dniu 2016-02-01 14:12, Michael Joyner napisał(a):
> >>
> >>>     I am trying to use Moses with Cherokee using the New Testament
> >>>     and Genesis as primary corpus. I am feeding it the WEB, BBE as
> >>>     source English texts at the moment.
> >>>
> >>>     As Cherokee uses bound pronouns and no articles and has almost
> >>>     nil preposition analogues, (these features are mostly verb
> >>>     infixes), is there a technique for corpus adjustment that can be
> >>>     done to improve the phrase mapping between Cherokee and English?
> >>>
> >>>     I am currently doing Cherokee => English.
> >>>     Thanks, Mike
> >>>     --
> >>>
> >>>     WEB: World English Bible (Public Domain)
> >>>     BBE: Basic English Bible (Public Domain)
> >>>
> >>>       * Learn to the Cherokee language:
> http://jalagigawoni.gnomio.com/
> >>>
> >>>
> >>>     _______________________________________________
> >>>     Moses-support mailing list
> >>>     [email protected]  <mailto:[email protected]>
> >>>     http://mailman.mit.edu/mailman/listinfo/moses-support
> >>
> >>
> >>
> >>     _______________________________________________
> >>     Moses-support mailing list
> >>     [email protected]  <mailto:[email protected]>
> >>     http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> >     _______________________________________________
> >     Moses-support mailing list
> >     [email protected] <mailto:[email protected]>
> >     http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> >
> >
> > --
> >
> >   * Learn to the Cherokee language: http://jalagigawoni.gnomio.com/
> >
> >
> >
> > _______________________________________________
> > Moses-support mailing list
> > [email protected]
> > http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



-- 

   - Learn to the Cherokee language: http://jalagigawoni.gnomio.com/

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Polysynthetic languages?

Reply via email to