[Nutch-general] Removing URLs from Web DB

Chris Schneider Fri, 17 Feb 2006 16:01:18 -0800

Nutch Users,

Does anyone have a tool or an easy method for removing URLs matchinga certain pattern from the MapReduce crawldb? For example, let's sayyou've been crawling for a while, and then realize that you'respending a lot of time trying to crawl bogus URLs with fake domainslike http://inherit-the-wind.ingrida.be/, so you add the followingline to your crawl-urlfilter.xml:


-\.ingrida.be

This will certainly prevent new URLs matching this pattern from beingadded to the crawldb, but it won't do anything about the URLs thatare already in there. Because of this, such URLs (particularly if theOPIC algorithm scores them highly) can continue to dominate the fetchlist.

I imagine that the PruneIndex tool can be used to remove pages thathave already been fetched from my indexes, but the presence ofthousands of these URLs in my crawldb is apparently interfering withthe performance of my crawl.


Thanks,

- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Removing URLs from Web DB

Reply via email to