Nutch Users,

Does anyone have a tool or an easy method for removing URLs matching a certain pattern from the MapReduce crawldb? For example, let's say you've been crawling for a while, and then realize that you're spending a lot of time trying to crawl bogus URLs with fake domains like http://inherit-the-wind.ingrida.be/, so you add the following line to your crawl-urlfilter.xml:

-\.ingrida.be

This will certainly prevent new URLs matching this pattern from being added to the crawldb, but it won't do anything about the URLs that are already in there. Because of this, such URLs (particularly if the OPIC algorithm scores them highly) can continue to dominate the fetch list.

I imagine that the PruneIndex tool can be used to remove pages that have already been fetched from my indexes, but the presence of thousands of these URLs in my crawldb is apparently interfering with the performance of my crawl.

Thanks,

- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to