Hi Tom, As far as I know, the following are widely-used and open-source Chinese tokenizers:
* https://github.com/fxsjy/jieba * http://sourceforge.net/projects/zpar/ * https://github.com/NLPchina/ansj_seg And this proprietary one: * http://ictclas.nlpir.org/ (Disclaimer: I am one of the developers of jieba, and I personally use this.) -- Dingyuan Wang 2015年12月19日 00:51於 "Tom Hoar" <[email protected]>寫道: > I'm looking for Chinese and Arabic tokenizers. We've been using > Stanford's for a while but it has downfalls. The Chinese mode loads its > statistical models very slowly. The Arabic mode stems the resulting > tokens. The coup de grace is that their latest jar update (9 days ago) > was compiled run only with Java 1.8. > > So, with the exception of Stanford, what choices are available for > Chinese and Arabic that you're finding worthwhile? > > Thanks! > Tom > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
