Chetan Patel wrote:
Hi Doğacan Güney,

Thanks for giving solution.

Is it possible to recrawl without removing files?


Yes and no. If the segment is already fetched, there is no fetch-again or update mode. You can copy segments to a new directory, rename the old directory, rename the new directory to the old name, do as Dogacan suggested and removed everything except crawl_generate in the copied directory, then fetch again.

Dennis

Thank you again.

Regads,
Chetan Patel


Doğacan Güney-3 wrote:
Hi,

On Mon, Sep 15, 2008 at 2:43 PM, Chetan Patel <[EMAIL PROTECTED]>
wrote:
Hi,

I have tried to re crawl script which is available on
http://wiki.apache.org/nutch/IntranetRecrawl.

I have got following error.

2008-09-15 17:04:32,238 INFO  fetcher.Fetcher - Fetcher: starting
2008-09-15 17:04:32,254 INFO  fetcher.Fetcher - Fetcher: segment:
google/segments/20080915170335
2008-09-15 17:04:32,972 FATAL fetcher.Fetcher - Fetcher:
java.io.IOException: Segment already fetched!
       at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:46)
       at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:329)
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
       at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
       at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
       at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
       at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)

2008-09-15 17:04:35,144 INFO  crawl.CrawlDb - CrawlDb update: starting

Plz. help me to solve this error.

Segment you are trying to crawl is already fetched. Try removing
everything but crawl_generate under that segment.

Thanks in advance

Regards,
Chetan Patel






--
Doğacan Güney



Reply via email to