Re: [Nutch-general] focused crawls -- where to add parse filter

Brian Whitman Sun, 18 Feb 2007 16:32:52 -0800

> How about an outlink filter that works during parse? In  
> ParseOutputFormat,
> it will take the parse text, parse data (etc.) of the source page and
> the destination url then will either return "filter this outlink" or
> "let it through".


> Write an HtmlParseFilter that sets an attribute in the ParseData  
> MetaData based on whether the page contains what you are looking  
> for. Then write another MR job that runs after the crawl/index  
> cycle.  This job would need to update the CrawlDatum MetaData based  
> on your priority calculation (inlinks and contains text, etc.).   
> Then hack the Generator class around line 160 to change the sort  
> value that it is using based on the CrawlDatum MetaData.  I would  
> make using this new sort value an option that you can turn on and  
> off by using different configuration values.

Hi Doğacan, Dennis:

Thanks for the ideas. I spent some time mentally planning out how to  
implement both of these ideas by looking at the source. I'm still  
newish to Nutch so excuse my naiveté.

Do either of these approaches let me get at the analyzed/indexed  
contents of the page text so that I can perform Lucene queries for  
filtering? What I could tell of the HtmlParseFilter or Parse in  
general is that it gets me at the parse tree, which i could do regexp  
queries on -- but I'd rather it be all in Lucene and be influenced by  
the relative ranking of terms amongst all documents. I am envisioning  
machine generated queries from our classifiers that might be hundreds  
of tokens long with boost values per term, and a score threshold. So  
I'd need to act on the documents post-index. Unless I'm reading your  
suggestions incorrectly, neither of them let me at that?


I am currently looking at PruneIndexTool -- could a modification of  
this work? I could run it after a crawl/index cycle but before  
invertlinks and the next generate. The one issue I see is that  
PruneIndexTool claims not to affect the WebDB. Does this mean that  
even though the lucene doc will be gone, the link and outlinks will  
remain in the WebDB and will be fetched anyway?

If I should instead be looking harder at your recommended  
HtmlParseFilter or ParseOutputFormat, please correct me.

-Brian


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] focused crawls -- where to add parse filter

Reply via email to