Plain tokenized text is good enough. It may even work as a tokenizer(?) 
if none is available. There is no specific notion of "infix themes" 
though. The segmentation is purely frequency-based, no linguistic 
motivation there, but it may just work.

It's easy enough, just run it and take a look at the results. Even it 
looks strange to you it may be worth to do a test training anyway. As I 
said, for Russian->English I get a nice improvement for patent data.

On 01.02.2016 19:30, Michael Joyner wrote:
> So how does that work?
>
> it just takes all the words from the corpus and guesses "infix themes" 
> ? Or do I have to supply pre-tagged data?
>
> On Mon, Feb 1, 2016 at 9:04 AM, Rico Sennrich <rico.sennr...@gmx.ch 
> <mailto:rico.sennr...@gmx.ch>> wrote:
>
>     Hi Mike,
>
>     here's a link to the tool Marcin mentioned:
>     https://github.com/rsennrich/subword-nmt
>
>     I haven't tried it on phrase-based MT myself, but feel free to
>     give it a try.
>
>     You could also try other unsupervised morpheme segmenters like
>     morfessor: https://github.com/aalto-speech/morfessor
>
>     I don't know if there's any segmentation methods specific for
>     Cherokee.
>
>     best wishes,
>     Rico
>
>
>     On 01.02.2016 13:31, Marcin Junczys-Dowmunt wrote:
>>
>>     Hi Mike,
>>
>>     Maybe take a look at Rico's tool for handling unknown words in
>>     neural machine translation. I have been playing around with that
>>     for Russian-English and standard phrase-based SMT with some
>>     success. I am just not sure if your small corpora will be enough
>>     to learn useful segmentations though.
>>
>>     It's an unsupervised method for word segmentation. For
>>     Russian-English I created a code dictionary of the 100,000
>>     most-frequent segments per language. Unseen tokens will get
>>     segmented. The segmentation is not neccessarily similar to a
>>     linguisticly correct segmentation, though. You will probably want
>>     to try smaller numbers.
>>
>>     Best,
>>
>>     Marcin
>>
>>     W dniu 2016-02-01 14:12, Michael Joyner napisaƂ(a):
>>
>>>     I am trying to use Moses with Cherokee using the New Testament
>>>     and Genesis as primary corpus. I am feeding it the WEB, BBE as
>>>     source English texts at the moment.
>>>
>>>     As Cherokee uses bound pronouns and no articles and has almost
>>>     nil preposition analogues, (these features are mostly verb
>>>     infixes), is there a technique for corpus adjustment that can be
>>>     done to improve the phrase mapping between Cherokee and English?
>>>
>>>     I am currently doing Cherokee => English.
>>>     Thanks, Mike
>>>     -- 
>>>
>>>     WEB: World English Bible (Public Domain)
>>>     BBE: Basic English Bible (Public Domain)
>>>
>>>       * Learn to the Cherokee language: http://jalagigawoni.gnomio.com/
>>>
>>>
>>>     _______________________________________________
>>>     Moses-support mailing list
>>>     Moses-support@mit.edu  <mailto:Moses-support@mit.edu>
>>>     http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>     _______________________________________________
>>     Moses-support mailing list
>>     Moses-support@mit.edu  <mailto:Moses-support@mit.edu>
>>     http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>     _______________________________________________
>     Moses-support mailing list
>     Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>     http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> -- 
>
>   * Learn to the Cherokee language: http://jalagigawoni.gnomio.com/
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to