Re: 'crawled already exists' - how do I recrawl?

Susam Pal Sat, 12 Jan 2008 22:08:58 -0800

The script creates a 'crawl' directory in the present working directory.

Where is your Nutch directory and where are you running the script. I
usually change directory to the top level Nutch directory, put the
script in the 'bin' directory, chmod a+x bin/crawl and then run it as
bin/crawl. So, as per this setup the crawl_generate directory should
be created in: crawl/segments/<segment-number>/crawl-generate (a
typical example of segment-number: 20080102215525).


Your error seems to come from this statement in the script:-

$NUTCH_HOME/bin/nutch fetch $segment -threads $threads

Fetcher tries to access $segment/crawl_generate in the beginning. In
your case the fetcher is trying to open:
/user/nutch/-threads/crawl_generate

So, it seems the above statement is resolved to:-

$NUTCH_HOME/bin/nutch fetch /user/nutch/-threads $threads

This means your $segment is /user/nutch and a space is missing between
$segment and -threads. Have you modified the script, altered the paths
but missed the space accidentally.

I hope this information and the script helps you to resolve the
problem. Whatever the result is, please let us know. This would help
us to improve the script if needed.

Regards,
Susam Pal

On Jan 13, 2008 11:19 AM, Manoj Bist <[EMAIL PROTECTED]> wrote:
> Thanks for the response.
> I tried this with nutch-0.9. The script seems to be accessing non-existent
> file/dirs.
>
> Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt
> exist : /user/nutch/-threads/crawl_generate
>         at org.apache.hadoop.mapred.InputFormatBase.validateInput(
> InputFormatBase.java:138)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
>
>
>
>
> On Jan 12, 2008 9:00 PM, Susam Pal <[EMAIL PROTECTED]> wrote:
>
> > You can try the crawl script: http://wiki.apache.org/nutch/Crawl
> >
> > Regards,
> > Susam Pal
> >
> > On Jan 13, 2008 8:36 AM, Manoj Bist <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> > > When I run crawl the second time, it always complains that 'crawled'
> > already
> > > exists. I always need to remove this directory using 'hadoop dfs -rm
> > > crawled' to get going.
> > > Is there some way to avoid this error and tell nutch that its a recrawl?
> > >
> > > bin/nutch crawl urls -dir crawled -depth 1  2>&1 | tee /tmp/foo.log
> > >
> > >
> > > Exception in thread "main" java.lang.RuntimeException: crawled already
> > > exists.
> > >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:85)
> > >
> > > Thanks,
> > >
> > > Manoj.
> > >
> > > --
> > > Tired of reading blogs? Listen to  your favorite blogs at
> > > http://www.blogbard.com   !!!!
> > >
> >
>
>
>
> --
>
> Tired of reading blogs? Listen to  your favorite blogs at
> http://www.blogbard.com   !!!!
>

Re: 'crawled already exists' - how do I recrawl?

Reply via email to