Re: focused crawls -- where to add parse filter

Dennis Kubes Sun, 18 Feb 2007 19:13:22 -0800

Brian Whitman wrote:

How about an outlink filter that works during parse? InParseOutputFormat,
it will take the parse text, parse data (etc.) of the source page and
the destination url then will either return "filter this outlink" or
"let it through".
Write an HtmlParseFilter that sets an attribute in the ParseDataMetaData based on whether the page contains what you are looking for.Then write another MR job that runs after the crawl/index cycle. Thisjob would need to update the CrawlDatum MetaData based on yourpriority calculation (inlinks and contains text, etc.). Then hack theGenerator class around line 160 to change the sort value that it isusing based on the CrawlDatum MetaData. I would make using this newsort value an option that you can turn on and off by using differentconfiguration values.
Hi Doğacan, Dennis:
Thanks for the ideas. I spent some time mentally planning out how toimplement both of these ideas by looking at the source. I'm still newishto Nutch so excuse my naiveté.
Do either of these approaches let me get at the analyzed/indexedcontents of the page text so that I can perform Lucene queries forfiltering? What I could tell of the HtmlParseFilter or Parse in generalis that it gets me at the parse tree, which i could do regexp queries on-- but I'd rather it be all in Lucene and be influenced by the relativeranking of terms amongst all documents. I am envisioning machinegenerated queries from our classifiers that might be hundreds of tokenslong with boost values per term, and a score threshold. So I'd need toact on the documents post-index. Unless I'm reading your suggestionsincorrectly, neither of them let me at that?

You could drop the HtmlParseFilter part and simply write the postcrawl/index MR job after to update the CrawlDatum based on your lucenequeries. You would still need to write the second part that does thegeneration based on a different sort value.

I am currently looking at PruneIndexTool -- could a modification of thiswork? I could run it after a crawl/index cycle but before invertlinksand the next generate. The one issue I see is that PruneIndexTool claimsnot to affect the WebDB. Does this mean that even though the lucene docwill be gone, the link and outlinks will remain in the WebDB and will befetched anyway?

That is correct. You will need to alter the CrawlDb to affect what isgenerated and hence fetched.

If I should instead be looking harder at your recommendedHtmlParseFilter or ParseOutputFormat, please correct me.

No if you are doing complex queries instead of something like "if thispage contains words x, y, and z" then I wouldn't do it throughHtmlParseFilter I would probably go with the lucene after index approach.


Dennis Kubes


-Brian

Re: focused crawls -- where to add parse filter

Reply via email to