答复: Someone Please respond ... Deletin g Urls already crawled from the crawlDB

wangkai Sun, 04 May 2008 23:13:09 -0700

Please try "CrawlDbMerger",

This tool merges several CrawlDb-s into one, optionally filtering URLs
through the current URLFilters, to skip prohibited pages.


It's possible to use this tool just for filtering - in that case only one
CrawlDb should be specified in arguments.


-----邮件原件-----
发件人: oddaniel [mailto:[EMAIL PROTECTED] 
发送时间: 2008年5月5日 13:27
收件人: [email protected]
主题: Someone Please respond ... Deleting Urls already crawled from the
crawlDB


Guys i have been trying to get this done for weeks now. No progress. Someone
please help me. I am trying to delete a domain already crawled from my
crawldb and index. 

I have a list of domains already crawled in my index. How do I exclude or
delete domains from my crawl output folder. I have tried using the
crawl-urlfilter.txt.

+^http://([a-z0-9]*\.)*
-^http://([a-z0-9]*?\.)*remita.net

Hoping it will exclude the domain remita.net from the crawldb or index and
include all the other urls.  Then I run the LinkDbMerger, SegmentMerger,
CrawlDbMerger, IndexMerger. No change. All domains remain part of my output.

Please how can I get this done.
-- 
View this message in context:
http://www.nabble.com/Someone-Please-respond-...-Deleting-Urls-already-crawl
ed-from-the-crawlDB-tp17053927p17053927.html
Sent from the Nutch - User mailing list archive at Nabble.com.

答复: Someone Please respond ... Deletin g Urls already crawled from the crawlDB

Reply via email to