Brian Whitman wrote: > >> How about an outlink filter that works during parse? In >> ParseOutputFormat, >> it will take the parse text, parse data (etc.) of the source page and >> the destination url then will either return "filter this outlink" or >> "let it through". > >> Write an HtmlParseFilter that sets an attribute in the ParseData >> MetaData based on whether the page contains what you are looking for. >> Then write another MR job that runs after the crawl/index cycle. This >> job would need to update the CrawlDatum MetaData based on your >> priority calculation (inlinks and contains text, etc.). Then hack the >> Generator class around line 160 to change the sort value that it is >> using based on the CrawlDatum MetaData. I would make using this new >> sort value an option that you can turn on and off by using different >> configuration values. > > Hi Doğacan, Dennis: > > Thanks for the ideas. I spent some time mentally planning out how to > implement both of these ideas by looking at the source. I'm still newish > to Nutch so excuse my naiveté. > > Do either of these approaches let me get at the analyzed/indexed > contents of the page text so that I can perform Lucene queries for > filtering? What I could tell of the HtmlParseFilter or Parse in general > is that it gets me at the parse tree, which i could do regexp queries on > -- but I'd rather it be all in Lucene and be influenced by the relative > ranking of terms amongst all documents. I am envisioning machine > generated queries from our classifiers that might be hundreds of tokens > long with boost values per term, and a score threshold. So I'd need to > act on the documents post-index. Unless I'm reading your suggestions > incorrectly, neither of them let me at that?
You could drop the HtmlParseFilter part and simply write the post crawl/index MR job after to update the CrawlDatum based on your lucene queries. You would still need to write the second part that does the generation based on a different sort value. > > I am currently looking at PruneIndexTool -- could a modification of this > work? I could run it after a crawl/index cycle but before invertlinks > and the next generate. The one issue I see is that PruneIndexTool claims > not to affect the WebDB. Does this mean that even though the lucene doc > will be gone, the link and outlinks will remain in the WebDB and will be > fetched anyway? That is correct. You will need to alter the CrawlDb to affect what is generated and hence fetched. > > If I should instead be looking harder at your recommended > HtmlParseFilter or ParseOutputFormat, please correct me. No if you are doing complex queries instead of something like "if this page contains words x, y, and z" then I wouldn't do it through HtmlParseFilter I would probably go with the lucene after index approach. Dennis Kubes > > -Brian > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
