答复: [Moses-support] about Morph tagging
Thank you very much! BTW, I’m studying Morphisto now, which is a morphological analyzer for German. http://code.google.com/p/morphisto/ And maybe I will use relevant HFST's tools as morphological analyzer for other languages. Best Regards Henry -邮件原件- 发件人: Francis Tyers [mailto:fty...@prompsit.com] 发送时间: 2010年10月20日 18:13 收件人: JiaHongwei 抄送: moses-support@mit.edu 主题: Re: [Moses-support] about Morph tagging You could use the morphological analysers from the Apertium project. http://wiki.apertium.org/wiki/Using_an_lttoolbox_dictionary http://wiki.apertium.org/wiki/Lttoolbox http://wiki.apertium.org/wiki/HFST Fran El dc 20 de 10 de 2010 a les 17:58 +0800, en/na JiaHongwei va escriure: Hi, I need to train a model with POS tags and morphological information for Moses involving languages such as German, Spanish, French and Italian. By using TreeTagger, I can get POS tags in the format 'form pos lemma'. But I want it further processed to be like this, such as 'form pos lemma morph'. So the job is taking 'form pos lemma' as input and output in format 'form pos lemma morph'. Could you recommend a way or a tool to help me do this job automatically or in pipeline? Thanks in advance! Best Regards Henry ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: 答复: [Moses-support] about Morph tagging
Just so you know, you can compile SFST transducers with HFST, in case you don't want to install many different tools :) Fran El dv 22 de 10 de 2010 a les 15:49 +0800, en/na JiaHongwei va escriure: Thank you very much! BTW, I’m studying Morphisto now, which is a morphological analyzer for German. http://code.google.com/p/morphisto/ And maybe I will use relevant HFST's tools as morphological analyzer for other languages. Best Regards Henry -邮件原件- 发件人: Francis Tyers [mailto:fty...@prompsit.com] 发送时间: 2010年10月20日 18:13 收件人: JiaHongwei 抄送: moses-support@mit.edu 主题: Re: [Moses-support] about Morph tagging You could use the morphological analysers from the Apertium project. http://wiki.apertium.org/wiki/Using_an_lttoolbox_dictionary http://wiki.apertium.org/wiki/Lttoolbox http://wiki.apertium.org/wiki/HFST Fran El dc 20 de 10 de 2010 a les 17:58 +0800, en/na JiaHongwei va escriure: Hi, I need to train a model with POS tags and morphological information for Moses involving languages such as German, Spanish, French and Italian. By using TreeTagger, I can get POS tags in the format 'form pos lemma'. But I want it further processed to be like this, such as 'form pos lemma morph'. So the job is taking 'form pos lemma' as input and output in format 'form pos lemma morph'. Could you recommend a way or a tool to help me do this job automatically or in pipeline? Thanks in advance! Best Regards Henry ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Word alignment information in binary phrase table
thanks christof, i think a lot of people will find this feature very useful. I've checked it in http://mosesdecoder.svn.sourceforge.net/viewvc/mosesdecoder?view=revisionrevision=3637 On 22/10/2010 00:01, Christof Pintaske wrote: Hi, train-model.perl with the parameter -phrase-word-alignment adds word-for-word alignment information to the phrase table. Unfortunately this information get's lost when converting the textual phrase-table into a binary format with processPhraseTable. Using processPhraseTable -alignment-info was meant to store the alignment information in the binary table as well. This functionality is broken since the format for the word alignment information changed and currently no word alignment information is stored in the binary phrase tables. Being required to use the textual file limits the size of the phrase-table in respect to the memory on the server. The attached patch provides the missing changes. It stores new-style alignment information with the target candidates in the phrase-table.binphr.tgtdata.wa file and reads them out correspondingly (It doesn't split the alignment information into source and target alignment as in the old implementation/format. It keeps it in a format supported by TargetPhrase::SetAlignmentInfo(std::string)). I tested the change with valgrind for both moses and processPhraseTable in a smaller moses translation system without any complaints. And both the translation and the alignment file that gets produced with moses -use-alignment-info -print-alignment-info -T File are identical, regardless of text or binary phrase-table. The patch should not change the behavior for phrase-tables without word-alignment. I hope you find the patch useful and hopefully it can be committed to repo. Of course, please let me know if any modifications are necessary or desirable. best regards Christof ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
Thanks Ken. Nice work. Is there a way to train the ARPA formatted LM with KenLM, or do we need to train with another tool, like SRILM or convert IRSTLM to full ARPA format? Thanks again, Tom On Mon, 18 Oct 2010 20:31:38 -0400, Kenneth Heafield mo...@kheafield.com wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
KenLM is inference-only. It cannot create ARPA files. So you'll need to use your favorite toolkit to generate the ARPA. On 10/22/10 07:52, supp...@precisiontranslationtools.com wrote: Thanks Ken. Nice work. Is there a way to train the ARPA formatted LM with KenLM, or do we need to train with another tool, like SRILM or convert IRSTLM to full ARPA format? Thanks again, Tom On Mon, 18 Oct 2010 20:31:38 -0400, Kenneth Heafield mo...@kheafield.com wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] KenLM distributed with Moses
Thanks, Ken. Tom On Fri, 22 Oct 2010 10:15:21 -0400, Kenneth Heafield mo...@kheafield.com wrote: KenLM is inference-only. It cannot create ARPA files. So you'll need to use your favorite toolkit to generate the ARPA. On 10/22/10 07:52, supp...@precisiontranslationtools.com wrote: Thanks Ken. Nice work. Is there a way to train the ARPA formatted LM with KenLM, or do we need to train with another tool, like SRILM or convert IRSTLM to full ARPA format? Thanks again, Tom On Mon, 18 Oct 2010 20:31:38 -0400, Kenneth Heafield mo...@kheafield.com wrote: Hi Moses, Introducing kenlm in Moses trunk. You no longer need to download a separate language model to use Moses; it's distributed with Moses and compiled in by default on UNIX. This is threadsafe language model inference code that returns the same probabilities as SRI (up to floating point rounding). It loads APRA files in 2/3 the time SRI takes and uses less memory too. Using kenlm is simple: in your [lmodel-file] section, change the first digit to 8. For example, 0 0 2 foo.arpa changes to 8 0 2 foo.arpa For even faster loading, use the binary format: kenlm/build_binary foo.arpa foo.binary then simply provide the binary filename in your moses.ini e.g. 8 0 2 foo.binary; it auto detects binary files using magic bytes at the beginning. The code is ready for use and provides correct results. Inference is slower than it should be due to inefficiencies in the Moses-side wrapper code (it does a vocab lookup for all 5 words every time). I'm working on it and once this is done I'll post some benchmarks against SRI and IRST. The binary format is subject to change, but contains a version number so on very rare occasions after, new versions will tell you to rebuild your binary files. Windows is currently not supported (it uses mmap) though I welcome contributions using #ifdef and CreateFileMapping. Have fun and let me know about your experiences with it. Ken ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support