Hi Tom, There used to be a freely available Chinese word segmenter provided by the LDC as well. Unfortunately, things keep disappearing from the web. https://web.archive.org/web/20130907032401/http://projects.ldc.upenn.edu/Chinese/LDC_ch.htm
For Arabic, I think that many academic research groups used to work with MADA. But it seems like you'll need a special license for commercial use. http://www1.cs.columbia.edu/~rambow/software-downloads/MADA_Distribution.html https://secure.nouvant.com/columbia/technology/cu14012/license/492 Or you try MorphTagger/Segmenter, a segmentation tool for Arabic SMT. http://www.hltpr.rwth-aachen.de/~mansour/MorphSegmenter/ It may not be maintained any more. You can contact Saab Mansour to ask about it. Saab has published a couple of papers about this, some of which report comparisons of different Arabic segmentation strategies for SMT. http://www.hltpr.rwth-aachen.de/publications/download/687/Mansour-IWSLT-2010.pdf http://www.hltpr.rwth-aachen.de/publications/download/808/Mansour-LREC-2012.pdf http://link.springer.com/article/10.1007%2Fs10590-011-9102-0 Cheers, Matthias On Sat, 2015-12-19 at 01:19 +0800, Dingyuan Wang wrote: > Hi Tom, > > As far as I know, the following are widely-used and open-source Chinese > tokenizers: > > * https://github.com/fxsjy/jieba > * http://sourceforge.net/projects/zpar/ > * https://github.com/NLPchina/ansj_seg > > And this proprietary one: > > * http://ictclas.nlpir.org/ > > (Disclaimer: I am one of the developers of jieba, and I personally use > this.) > > -- > Dingyuan Wang > 2015年12月19日 00:51於 "Tom Hoar" <[email protected]>寫道: > > > I'm looking for Chinese and Arabic tokenizers. We've been using > > Stanford's for a while but it has downfalls. The Chinese mode loads its > > statistical models very slowly. The Arabic mode stems the resulting > > tokens. The coup de grace is that their latest jar update (9 days ago) > > was compiled run only with Java 1.8. > > > > So, with the exception of Stanford, what choices are available for > > Chinese and Arabic that you're finding worthwhile? > > > > Thanks! > > Tom > > _______________________________________________ > > Moses-support mailing list > > [email protected] > > http://mailman.mit.edu/mailman/listinfo/moses-support > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
