Hello,

for a vertical search engine I am experimenting with mechanisms using nutch to restrict the crawl to only documents with relevance to a specific topic. (I am still a newbie with nutch, so please bear with me). The evaluation of the documents should be done with a plugin which will get the already parsed raw text and then decide via some sophisticated (to be done) mechanism whether this is relevant. As far as I can see there are several points in the crawl process where this may be done:

a) parse plugin (to be called after document was parsed with parse-html etc.) : if the text does not match, ignore it, do not index it, and do not write the URL and extracted URLs into the crawldb. This has the effect, that the crawl may not be very extensive and we will never get to some pages that may be relevant because the pages leading there are not considered.

b) index filter (extension point IndexingFilter): Just do not index the document. The URL gets written to the crawldb however.

or a combination of these. So far, any objections, comments ?

Now my questions:
1) I have already managed to implement a very basic index filter plugin, however the only possibility I have seen of getting the document to not be indexed, is to throw an IndexingException. What is the advised solution? 2) For the a) solution, I am looking for a way to daisy-chain the parse-filters. I would like my new parse filter to be called after parsing html, pdf etc. Putting them in parse-plugins.xml is exclusive, right? So how do I do this? 3) How do I change the score of the URLs in the crawldb? Some URLs maybe, should not be completely ignored but marked as comparatively irrelevant.

Thanks for any help.

Sybille

Reply via email to