Re: 'crawled already exists' - how do I recrawl?

Manoj Bist Sat, 12 Jan 2008 21:50:03 -0800

Thanks for the response.
I tried this with nutch-0.9. The script seems to be accessing non-existent
file/dirs.


Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt
exist : /user/nutch/-threads/crawl_generate
        at org.apache.hadoop.mapred.InputFormatBase.validateInput(
InputFormatBase.java:138)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)



On Jan 12, 2008 9:00 PM, Susam Pal <[EMAIL PROTECTED]> wrote:

> You can try the crawl script: http://wiki.apache.org/nutch/Crawl
>
> Regards,
> Susam Pal
>
> On Jan 13, 2008 8:36 AM, Manoj Bist <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > When I run crawl the second time, it always complains that 'crawled'
> already
> > exists. I always need to remove this directory using 'hadoop dfs -rm
> > crawled' to get going.
> > Is there some way to avoid this error and tell nutch that its a recrawl?
> >
> > bin/nutch crawl urls -dir crawled -depth 1  2>&1 | tee /tmp/foo.log
> >
> >
> > Exception in thread "main" java.lang.RuntimeException: crawled already
> > exists.
> >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:85)
> >
> > Thanks,
> >
> > Manoj.
> >
> > --
> > Tired of reading blogs? Listen to  your favorite blogs at
> > http://www.blogbard.com   !!!!
> >
>



-- 
Tired of reading blogs? Listen to  your favorite blogs at
http://www.blogbard.com   !!!!

Re: 'crawled already exists' - how do I recrawl?

Reply via email to