Nutch Users,
Does anyone have a tool or an easy method for removing URLs matching
a certain pattern from the MapReduce crawldb? For example, let's say
you've been crawling for a while, and then realize that you're
spending a lot of time trying to crawl bogus URLs with fake domains
like http://inherit-the-wind.ingrida.be/, so you add the following
line to your crawl-urlfilter.xml:
-\.ingrida.be
This will certainly prevent new URLs matching this pattern from being
added to the crawldb, but it won't do anything about the URLs that
are already in there. Because of this, such URLs (particularly if the
OPIC algorithm scores them highly) can continue to dominate the fetch
list.
I imagine that the PruneIndex tool can be used to remove pages that
have already been fetched from my indexes, but the presence of
thousands of these URLs in my crawldb is apparently interfering with
the performance of my crawl.
Thanks,
- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general