Document Classification - indexing question

Bastian Preindl Tue, 08 May 2007 03:37:46 -0700

Hi,

I'd like to use Nutch for crawling parts of the web and automaticallyclassify the fetched documents before indexing them. I've already donesome investigations on how to achieve this and have read about differentclassification techniques like Bayes, SVM a.s.o. I've also already madesome offline classification tests with several libraries and think thatthe best would be a pre-classification (has interesting content, doesn'thave interesting content) using something similar to CRM114 and a"fine-grained" multi-classification afterwards with the interestingdocuments using SVM or something similar to Lingpipe.

My question is now: Which extension point is appropriate for such aplugin or extension and how can I avoid that documents which are notinteresting are even indexed?

To illustrate my approach I'd like to apply the following actionsstep-by-step:


1: Fetch a new document from the web

2: Pre-classify the document (interesting, not interesting) with analready trained filter/classifier - positive: goto 3, negative: goto 43: Classify the interesting document using an already trained classifierhaving multi-classification-capabilities and index it withmeta-information about the document's class/category, goto 54: Throw the document's content and the URL away, forget it, don't indexit, goto 5

5: Fetch the next document (goto 1)

Which are the best points to "hook in" with such a classification andhow do I tell Nutch to throw a document completely away and to not index it?

I would be very encouraged if somebody could provide some hints on thisor (even better) a field report on how this can be achieved.


Thank you very much in advance

Bastian

Document Classification - indexing question

Reply via email to