It's my pleasure. -----邮件原件----- 发件人: oddaniel [mailto:[EMAIL PROTECTED] 发送时间: 2008年5月5日 20:39 收件人: [email protected] 主题: Re: 答复: Someone Please respond ... Deleting Urls already crawled from the crawlDB
Thanks. Problem solved. wangkai wrote: > > Please try "CrawlDbMerger", > > This tool merges several CrawlDb-s into one, optionally filtering URLs > through the current URLFilters, to skip prohibited pages. > > It's possible to use this tool just for filtering - in that case only one > CrawlDb should be specified in arguments. > > > -----邮件原件----- > 发件人: oddaniel [mailto:[EMAIL PROTECTED] > 发送时间: 2008年5月5日 13:27 > 收件人: [email protected] > 主题: Someone Please respond ... Deleting Urls already crawled from the > crawlDB > > > Guys i have been trying to get this done for weeks now. No progress. > Someone > please help me. I am trying to delete a domain already crawled from my > crawldb and index. > > I have a list of domains already crawled in my index. How do I exclude or > delete domains from my crawl output folder. I have tried using the > crawl-urlfilter.txt. > > +^http://([a-z0-9]*\.)* > -^http://([a-z0-9]*?\.)*remita.net > > Hoping it will exclude the domain remita.net from the crawldb or index and > include all the other urls. Then I run the LinkDbMerger, SegmentMerger, > CrawlDbMerger, IndexMerger. No change. All domains remain part of my > output. > > Please how can I get this done. > -- > View this message in context: > http://www.nabble.com/Someone-Please-respond-...-Deleting-Urls-already-crawl > ed-from-the-crawlDB-tp17053927p17053927.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > -- View this message in context: http://www.nabble.com/Someone-Please-respond-...-Deleting-Urls-already-crawled-from-the-crawlDB-tp17053927p17060767.html Sent from the Nutch - User mailing list archive at Nabble.com.
