2012/6/15 xinfan meng <[email protected]>: > The docs tell you that you canĀ customizeĀ an define a preprocessor to first > segment the text if needed, e.g. in Chinese or Japanese. However, sklearn > does not provide one such preprocessor. To see how you can implement one, > the best way is to take a look at the codes. I think the text processing > pipeline is pretty clear, thanks to Olivier's work.
+1, there is plenty of chinese word segmenters around: https://www.google.com/search?q=chinese+word+segmentation+python I haven't used any of them so I cannot make a recommendation. There is also a word about this problem in the coursera NLP class: https://class.coursera.org/nlp/lecture/preview and finally the nltk documentation talks a bit about of the problem too here: http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#word-segmentation but does not seem to provide ready-to-use models for chinese. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
