The docs tell you that you can customize an define a preprocessor to first
segment the text if needed, e.g. in Chinese or Japanese. However, sklearn
does not provide one such preprocessor. To see how you can implement one,
the best way is to take a look at the codes. I think the text processing
pipeline is pretty clear, thanks to Olivier's work.
On Wed, Jun 13, 2012 at 7:22 PM, Dinesh B Vadhia
<[email protected]>wrote:
> **
> Hi! In the docs under Customizing the vectorizer classes -
>
> http://scikit-learn.org/dev/modules/feature_extraction.html#customizing-the-vectorizer-classes-
> it says, "Customizing
> the vectorizer can be very useful to handle Asian languages that do not use
> an explicit word separator such as the whitespace for instance."
>
> I want to process a large volume of Asian language summary docs (in
> different Asian languages) and want to understand how SciKit can be
> utilised for achieving the above. Does someone on the list have experience
> in this area or is there an open-source module available? Thank-you.
>
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Best Wishes
--------------------------------------------
Meng Xinfan(蒙新泛)
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general