Hi Chee Wu, If you're looking for a Java-based solution, you might find it worthwhile to look at LibSVM. You can use this open source package to train a Support Vector Machine based classifier, which can then be used to classify the documents that Nutch crawls for you. In general, more the number of training documents, better the accuracy. Keep in mind that training documents must be carefully hand-picked, to minimize false classification. You can use LibSVM for 2-class as well as multi-class classification tasks.
-- Regards.... ~ Ashish Saharia ~ -----Original Message----- From: chee wu [mailto:[EMAIL PROTECTED] Sent: Sunday, February 04, 2007 7:29 PM To: [email protected] Subject: Any successful experiences for text classification ? Hi, I am trying to divide all the web pages crawled to predefined categories,does anybody have successfully fulfilled classification based on Nutch? I did find some threads talking about this,but none of them are clear enough. Below are some possible solutions mentioned in the past threads : 1. Using SVM-Light, but it seems a C based program ? 2. Can I fulfill this based on Carrot2? 3. Other open source software packages like Rainbow or IBM UIMA ? I want to do a deeper research on the three options above,which one should I study first? Any other hints or experiences also are welcome! Thanks -Chee ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
