[Nutch-general] Re: Removing URLs from Web DB

Andrzej Bialecki Sat, 18 Feb 2006 00:37:14 -0800

Chris Schneider wrote:

Nutch Users,
Does anyone have a tool or an easy method for removing URLs matching acertain pattern from the MapReduce crawldb? For example, let's sayyou've been crawling for a while, and then realize that you'respending a lot of time trying to crawl bogus URLs with fake domainslike http://inherit-the-wind.ingrida.be/, so you add the followingline to your crawl-urlfilter.xml:
-\.ingrida.be

No such tool exists yet, but your intuition is right. Actually I thinkit would require only a minor modification to CrawlDB.update() to addthe functionality to re-filter the existing crawlDB. Where in update()you set up input data, you can simply add just the existing CrawlDB (notadding any segment data). Then in CrawlDBReducer.reduce() you pass theWritableComparable key (which is a URL in disguise) through URLFilters.If it comes out null, return, i.e. don't collect the result.


That's all. :-)

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Removing URLs from Web DB

Reply via email to