Hi, I'd like to use Nutch for crawling parts of the web and automatically classify the fetched documents before indexing them. I've already done some investigations on how to achieve this and have read about different classification techniques like Bayes, SVM a.s.o. I've also already made some offline classification tests with several libraries and think that the best would be a pre-classification (has interesting content, doesn't have interesting content) using something similar to CRM114 and a "fine-grained" multi-classification afterwards with the interesting documents using SVM or something similar to Lingpipe.
My question is now: Which extension point is appropriate for such a plugin or extension and how can I avoid that documents which are not interesting are even indexed? To illustrate my approach I'd like to apply the following actions step-by-step: 1: Fetch a new document from the web 2: Pre-classify the document (interesting, not interesting) with an already trained filter/classifier - positive: goto 3, negative: goto 4 3: Classify the interesting document using an already trained classifier having multi-classification-capabilities and index it with meta-information about the document's class/category, goto 5 4: Throw the document's content and the URL away, forget it, don't index it, goto 5 5: Fetch the next document (goto 1) Which are the best points to "hook in" with such a classification and how do I tell Nutch to throw a document completely away and to not index it? I would be very encouraged if somebody could provide some hints on this or (even better) a field report on how this can be achieved. Thank you very much in advance Bastian ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers