Are you familiar with the KhmerOS project on Sourceforge.net?

http://sourceforge.net/projects/khmer/?source=directory

At one time, it included an implementation of Moses through our DoMY distribution. There was parallel corpus and -- if I'm not mistaken -- there was a tokenizer. A quick look at the project shows it's changed. So, you might have to dig deeper. Let me know if you can't find anything, and I'll try again.

Tom



On 11/09/2014 08:08 PM, Hieu Hoang wrote:
There is no specific Khmer tokenizer in Moses so the tokenizer uses the english scheme.

Each language tokenizer needs a file in
   scripts/share/nonbreaking_prefixes
You should create your own for Khmer. If you do, please share it with us.

If this is still not good enough, you should write your own program to tokenize Khmer.


On 9 November 2014 02:29, Sovath-MITE-319 <[email protected] <mailto:[email protected]>> wrote:

    Dear Mr. Hieu Hoang,

    Thank you very much for you quick reply. I can get it works with
    your tips.

    However, i have been working with Khmer Unicode (utf8), i seem to have
    problem with tokenizers which unable me to render not properly.


    Do you have any tips of how to get moses work with unicode (utf8, i
    means Khmer Unicode).


    My Best Regards,

    Sovath Chhinh

    On Tue, Nov 4, 2014 at 1:10 AM, Hieu Hoang <[email protected]
    <mailto:[email protected]>> wrote:
    > I think there's differences in different versions of irstlm.
    Maybe try
    >    --text yes
    >    --text
    >    -text yes
    >    -text
    > Also, Moses comes with the script
    >    scripts/generic/trainlm-irst2.perl
    > which runs IRSTLM for you. You just need to give it the text file.
    >
    > Also, you might want to look at KenLM's lmplz command, which
    also creates a
    > LM
    >
    > On 30 October 2014 15:19, Sovath-MITE-319
    <[email protected] <mailto:[email protected]>>
    > wrote:
    >>
    >> Dear Sir,
    >>
    >> I am a student from Royal University of Phnom Penh, Cambodia.
    >>
    >> I am under taking Master Degree of Computer Science and my
    thesis is
    >> working on Paralell Corpus from Khmer to English.
    >>
    >> However, I have no problem with moses installation as well as
    the other
    >> tools.
    >>
    >> Come to step number 5, i seem to get stuck and can't find any
    resource
    >> to fix this problem.
    >> I have found one article that has the same problem too,
    >> (http://comments.gmane.org/gmane.comp.nlp.moses.user/9924).
    >> But there seems to have no solution. I am not sure if there is
    >> something that require to configure before processing step
    number 5.
    >>
    >> PS: Step that i have issue
    >>
    >>  mkdir ~/lm
    >>  cd ~/lm
    >>  ~/irstlm/bin/add-start-end.sh  \
    >>    < ~/corpus/news-commentary-v8.fr-en.true.en \
    >>    > news-commentary-v8.fr-en.sb.en
    >>  export IRSTLM=$HOME/irstlm; ~/irstlm/bin/build-lm.sh \
    >>    -i news-commentary-v8.fr-en.sb.en       \
    >>    -t ./tmp  -p -s improved-kneser-ney -o
    news-commentary-v8.fr-en.lm.en
    >>  ~/irstlm/bin/compile-lm  \
    >>    --text yes \
    >>    news-commentary-v8.fr-en.lm.en.gz \
    >>    news-commentary-v8.fr-en.arpa.en
    >>
    >> Looking forward to hearing from your support.
    >>
    >> Best Regards,
    >> Sovath
    >> _______________________________________________
    >> Moses-support mailing list
    >> [email protected] <mailto:[email protected]>
    >> http://mailman.mit.edu/mailman/listinfo/moses-support
    >
    >
    >
    >
    > --
    > Hieu Hoang
    > Research Associate
    > University of Edinburgh
    > http://www.hoang.co.uk/hieu
    >




--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to