Please try "CrawlDbMerger", This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.
It's possible to use this tool just for filtering - in that case only one CrawlDb should be specified in arguments. -----邮件原件----- 发件人: oddaniel [mailto:[EMAIL PROTECTED] 发送时间: 2008年5月5日 13:27 收件人: [email protected] 主题: Someone Please respond ... Deleting Urls already crawled from the crawlDB Guys i have been trying to get this done for weeks now. No progress. Someone please help me. I am trying to delete a domain already crawled from my crawldb and index. I have a list of domains already crawled in my index. How do I exclude or delete domains from my crawl output folder. I have tried using the crawl-urlfilter.txt. +^http://([a-z0-9]*\.)* -^http://([a-z0-9]*?\.)*remita.net Hoping it will exclude the domain remita.net from the crawldb or index and include all the other urls. Then I run the LinkDbMerger, SegmentMerger, CrawlDbMerger, IndexMerger. No change. All domains remain part of my output. Please how can I get this done. -- View this message in context: http://www.nabble.com/Someone-Please-respond-...-Deleting-Urls-already-crawl ed-from-the-crawlDB-tp17053927p17053927.html Sent from the Nutch - User mailing list archive at Nabble.com.
