Hello,
for a vertical search engine I am experimenting with mechanisms using
nutch to restrict the crawl to only documents with relevance to a
specific topic. (I am still a newbie with nutch, so please bear with
me). The evaluation of the documents should be done with a plugin which
will get the already parsed raw text and then decide via some
sophisticated (to be done) mechanism whether this is relevant. As far as
I can see there are several points in the crawl process where this may
be done:
a) parse plugin (to be called after document was parsed with parse-html
etc.) : if the text does not match, ignore it, do not index it, and do
not write the URL and extracted URLs into the crawldb. This has the
effect, that the crawl may not be very extensive and we will never get
to some pages that may be relevant because the pages leading there are
not considered.
b) index filter (extension point IndexingFilter): Just do not index the
document. The URL gets written to the crawldb however.
or a combination of these. So far, any objections, comments ?
Now my questions:
1) I have already managed to implement a very basic index filter plugin,
however the only possibility I have seen of getting the document to not
be indexed, is to throw an IndexingException. What is the advised solution?
2) For the a) solution, I am looking for a way to daisy-chain the
parse-filters. I would like my new parse filter to be called after
parsing html, pdf etc. Putting them in parse-plugins.xml is exclusive,
right? So how do I do this?
3) How do I change the score of the URLs in the crawldb? Some URLs
maybe, should not be completely ignored but marked as comparatively
irrelevant.
Thanks for any help.
Sybille