selective crawl

Sybille Peters Thu, 20 Nov 2008 09:58:49 -0800

Hello,

for a vertical search engine I am experimenting with mechanisms usingnutch to restrict the crawl to only documents with relevance to aspecific topic. (I am still a newbie with nutch, so please bear withme). The evaluation of the documents should be done with a plugin whichwill get the already parsed raw text and then decide via somesophisticated (to be done) mechanism whether this is relevant. As far asI can see there are several points in the crawl process where this maybe done:

a) parse plugin (to be called after document was parsed with parse-htmletc.) : if the text does not match, ignore it, do not index it, and donot write the URL and extracted URLs into the crawldb. This has theeffect, that the crawl may not be very extensive and we will never getto some pages that may be relevant because the pages leading there arenot considered.

b) index filter (extension point IndexingFilter): Just do not index thedocument. The URL gets written to the crawldb however.


or a combination of these. So far, any objections, comments ?

Now my questions:

1) I have already managed to implement a very basic index filter plugin,however the only possibility I have seen of getting the document to notbe indexed, is to throw an IndexingException. What is the advised solution?2) For the a) solution, I am looking for a way to daisy-chain theparse-filters. I would like my new parse filter to be called afterparsing html, pdf etc. Putting them in parse-plugins.xml is exclusive,right? So how do I do this?3) How do I change the score of the URLs in the crawldb? Some URLsmaybe, should not be completely ignored but marked as comparativelyirrelevant.


Thanks for any help.

Sybille

selective crawl

Reply via email to