Re: Incremental crawl again ... (Please explain)

Jacob Brunson Thu, 25 May 2006 14:17:08 -0700

Addition comments and testcase below.

On 5/25/06, Zaheed Haque <[EMAIL PROTECTED]> wrote:

On 5/25/06, Jacob Brunson <[EMAIL PROTECTED]> wrote:
> I looked at the referenced messaged at
> http://www.mail-archive.com/[email protected]/msg03990.html
> but I am still having problems.
>
> I am running the latest checkout from subversion.
>
> These are the commands which I've run:


> bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN 10000

bin/nutch crawl - is a one shot command to fetch/generate/index a
nutch index. I would NOT recommend one to use this one shot command.

Thats funny because when I look at the source code for crawling, it
does pretty much the same thing as the "whole web crawling" method.


Please take the long route which will give you more control over your
tasks. The long route meaning - inject, generate, fetch, updatedb,
index, dedup, merge. Please see the following -
Whole web crawling...

http://lucene.apache.org/nutch/tutorial8.html#Whole-web+Crawling

Yes, I've gone though that tutorial also and followed it and I'm
having the same problem.  The tutorial does not describe how to add to
the original index.  If you can help me figure out this, I would be
glad to add to the tutorial and make it more complete.

Just to be perfectly clear, these are the complete set of steps I take
to get the error.  (I'm running Java1.5, only the urls/ directory
exists at the beginning):
$ svn update
$ ant
$ bin/nutch inject crawl.test/crawldb urls/
$ bin/nutch generate crawl.test/crawldb crawl.test/segments -topN 20
$ lastsegment=`ls -d crawl.test/segments/2* | tail -1`
$ bin/nutch fetch $lastsegment
$ bin/nutch updatedb crawl.test/crawldb $lastsegment
$ bin/nutch invertlinks crawl.test/linkdb $lastsegment
$ bin/nutch index crawl.test/indexes crawl.test/crawldb
crawl.test/linkdb $lastsegment
$ bin/nutch merge crawl.test/index crawl.test/indexes
$ bin/nutch generate crawl.test/crawldb crawl.test/segments -topN 20
$ lastsegment=`ls -d crawl.test/segments/2* | tail -1`
$ bin/nutch fetch $lastsegment
$ bin/nutch updatedb crawl.test/crawldb $lastsegment
$ bin/nutch invertlinks crawl.test/linkdb $lastsegment
$ bin/nutch index crawl.test/indexes crawl.test/crawldb
crawl.test/linkdb $lastsegment

And at this point, I have my problem.  I get the following output:
060525 171327 Indexer: adding segment: crawl.test/segments/20060525165518
Exception in thread "main" java.io.IOException: Output directory
/home/nutch/nutch/crawl.test/indexes already exists.
       at 
org.apache.hadoop.mapred.OutputFormatBase.checkOutputSpecs(OutputFormatBase.java:37)
       at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:263)
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:311)
       at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
       at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

So if you could help me figure out what I need to do differently, I
would be sure to update the tutorial on the on the wiki to help others
who might have the same problems as me.
Thanks,
Jacob

Re: Incremental crawl again ... (Please explain)

Reply via email to