Hi Tom,

There used to be a freely available Chinese word segmenter provided by
the LDC as well. Unfortunately, things keep disappearing from the web.
https://web.archive.org/web/20130907032401/http://projects.ldc.upenn.edu/Chinese/LDC_ch.htm

For Arabic, I think that many academic research groups used to work with
MADA. But it seems like you'll need a special license for commercial
use.
http://www1.cs.columbia.edu/~rambow/software-downloads/MADA_Distribution.html
https://secure.nouvant.com/columbia/technology/cu14012/license/492

Or you try MorphTagger/Segmenter, a segmentation tool for Arabic SMT. 
http://www.hltpr.rwth-aachen.de/~mansour/MorphSegmenter/
It may not be maintained any more. You can contact Saab Mansour to ask
about it.

Saab has published a couple of papers about this, some of which report
comparisons of different Arabic segmentation strategies for SMT.
http://www.hltpr.rwth-aachen.de/publications/download/687/Mansour-IWSLT-2010.pdf
http://www.hltpr.rwth-aachen.de/publications/download/808/Mansour-LREC-2012.pdf
http://link.springer.com/article/10.1007%2Fs10590-011-9102-0

Cheers,
Matthias


On Sat, 2015-12-19 at 01:19 +0800, Dingyuan Wang wrote:
> Hi Tom,
> 
> As far as I know, the following are widely-used and open-source Chinese
> tokenizers:
> 
> * https://github.com/fxsjy/jieba
> * http://sourceforge.net/projects/zpar/
> * https://github.com/NLPchina/ansj_seg
> 
> And this proprietary one:
> 
> * http://ictclas.nlpir.org/
> 
> (Disclaimer: I am one of the developers of jieba, and I personally use
> this.)
> 
> --
> Dingyuan Wang
> 2015年12月19日 00:51於 "Tom Hoar" <[email protected]>寫道:
> 
> > I'm looking for Chinese and Arabic tokenizers. We've been using
> > Stanford's for a while but it has downfalls. The Chinese mode loads its
> > statistical models very slowly. The Arabic mode stems the resulting
> > tokens. The coup de grace is that their latest jar update (9 days ago)
> > was compiled run only with Java 1.8.
> >
> > So, with the exception of Stanford, what choices are available for
> > Chinese and Arabic that you're finding worthwhile?
> >
> > Thanks!
> > Tom
> > _______________________________________________
> > Moses-support mailing list
> > [email protected]
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to