答复: 答复: Someone Please respond . .. Deleting Urls already crawled from the crawlDB

wangkai Mon, 05 May 2008 06:52:44 -0700

It's my pleasure.

-----邮件原件-----
发件人: oddaniel [mailto:[EMAIL PROTECTED] 
发送时间: 2008年5月5日 20:39
收件人: [email protected]
主题: Re: 答复: Someone Please respond ... Deleting Urls already crawled from the 
crawlDB



Thanks. Problem solved. 


wangkai wrote:
> 
> Please try "CrawlDbMerger",
> 
> This tool merges several CrawlDb-s into one, optionally filtering URLs
> through the current URLFilters, to skip prohibited pages. 
> 
> It's possible to use this tool just for filtering - in that case only one
> CrawlDb should be specified in arguments.
> 
> 
> -----邮件原件-----
> 发件人: oddaniel [mailto:[EMAIL PROTECTED] 
> 发送时间: 2008年5月5日 13:27
> 收件人: [email protected]
> 主题: Someone Please respond ... Deleting Urls already crawled from the
> crawlDB
> 
> 
> Guys i have been trying to get this done for weeks now. No progress.
> Someone
> please help me. I am trying to delete a domain already crawled from my
> crawldb and index. 
> 
> I have a list of domains already crawled in my index. How do I exclude or
> delete domains from my crawl output folder. I have tried using the
> crawl-urlfilter.txt.
> 
> +^http://([a-z0-9]*\.)*
> -^http://([a-z0-9]*?\.)*remita.net
> 
> Hoping it will exclude the domain remita.net from the crawldb or index and
> include all the other urls.  Then I run the LinkDbMerger, SegmentMerger,
> CrawlDbMerger, IndexMerger. No change. All domains remain part of my
> output.
> 
> Please how can I get this done.
> -- 
> View this message in context:
> http://www.nabble.com/Someone-Please-respond-...-Deleting-Urls-already-crawl
> ed-from-the-crawlDB-tp17053927p17053927.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Someone-Please-respond-...-Deleting-Urls-already-crawled-from-the-crawlDB-tp17053927p17060767.html
Sent from the Nutch - User mailing list archive at Nabble.com.

答复: 答复: Someone Please respond . .. Deleting Urls already crawled from the crawlDB

Reply via email to