> How about an outlink filter that works during parse? In > ParseOutputFormat, > it will take the parse text, parse data (etc.) of the source page and > the destination url then will either return "filter this outlink" or > "let it through".
> Write an HtmlParseFilter that sets an attribute in the ParseData > MetaData based on whether the page contains what you are looking > for. Then write another MR job that runs after the crawl/index > cycle. This job would need to update the CrawlDatum MetaData based > on your priority calculation (inlinks and contains text, etc.). > Then hack the Generator class around line 160 to change the sort > value that it is using based on the CrawlDatum MetaData. I would > make using this new sort value an option that you can turn on and > off by using different configuration values. Hi Doğacan, Dennis: Thanks for the ideas. I spent some time mentally planning out how to implement both of these ideas by looking at the source. I'm still newish to Nutch so excuse my naiveté. Do either of these approaches let me get at the analyzed/indexed contents of the page text so that I can perform Lucene queries for filtering? What I could tell of the HtmlParseFilter or Parse in general is that it gets me at the parse tree, which i could do regexp queries on -- but I'd rather it be all in Lucene and be influenced by the relative ranking of terms amongst all documents. I am envisioning machine generated queries from our classifiers that might be hundreds of tokens long with boost values per term, and a score threshold. So I'd need to act on the documents post-index. Unless I'm reading your suggestions incorrectly, neither of them let me at that? I am currently looking at PruneIndexTool -- could a modification of this work? I could run it after a crawl/index cycle but before invertlinks and the next generate. The one issue I see is that PruneIndexTool claims not to affect the WebDB. Does this mean that even though the lucene doc will be gone, the link and outlinks will remain in the WebDB and will be fetched anyway? If I should instead be looking harder at your recommended HtmlParseFilter or ParseOutputFormat, please correct me. -Brian ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
