Hi,
I'd like to use Nutch for crawling parts of the web and automatically
classify the fetched documents before indexing them. I've already done
some investigations on how to achieve this and have read about different
classification techniques like Bayes, SVM a.s.o. I've also already made
some offline classification tests with several libraries and think that
the best would be a pre-classification (has interesting content, doesn't
have interesting content) using something similar to CRM114 and a
"fine-grained" multi-classification afterwards with the interesting
documents using SVM or something similar to Lingpipe.
My question is now: Which extension point is appropriate for such a
plugin or extension and how can I avoid that documents which are not
interesting are even indexed?
To illustrate my approach I'd like to apply the following actions
step-by-step:
1: Fetch a new document from the web
2: Pre-classify the document (interesting, not interesting) with an
already trained filter/classifier - positive: goto 3, negative: goto 4
3: Classify the interesting document using an already trained classifier
having multi-classification-capabilities and index it with
meta-information about the document's class/category, goto 5
4: Throw the document's content and the URL away, forget it, don't index
it, goto 5
5: Fetch the next document (goto 1)
Which are the best points to "hook in" with such a classification and
how do I tell Nutch to throw a document completely away and to not index it?
I would be very encouraged if somebody could provide some hints on this
or (even better) a field report on how this can be achieved.
Thank you very much in advance
Bastian