Re: [Scikit-learn-general] Customizing the vectorizer classes ... for Asian Languages

Olivier Grisel Fri, 15 Jun 2012 02:32:46 -0700

2012/6/15 xinfan meng <[email protected]>:
> The docs tell you that you can customize an define a preprocessor to first
> segment the text if needed, e.g. in Chinese or Japanese. However, sklearn
> does not provide one such preprocessor. To see how you can implement one,
> the best way is to take a look at the codes. I think the text processing
> pipeline is pretty clear, thanks to Olivier's work.


+1, there is plenty of chinese word segmenters around:

https://www.google.com/search?q=chinese+word+segmentation+python

I haven't used any of them so I cannot make a recommendation. There is
also a word about this problem in the coursera NLP class:

  https://class.coursera.org/nlp/lecture/preview

and finally the nltk documentation talks a bit about of the problem too here:

  http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#word-segmentation

but does not seem to provide ready-to-use models for chinese.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Customizing the vectorizer classes ... for Asian Languages

Reply via email to