Hi,
I am using nutch-1.0-dev to crawl an Intranet.

My problem is recrawling. I found interesting pointers on a 
weblog post [1], but no solution on how to do a recrawl properly. 
There is a script [2] on the Nutch wiki but it does not work 
with the Hadoop Distributed File System.

                               - o -

The script can be fixed changing all the commands using the local 
file system, for example:

  cp -R $mergesegs_dir/* $segments_dir
  rm -rf $mergesegs_dir

into the corresponding distributed file system version:

  $nutch_dir/hadoop dfs -cp $mergesegs_dir/* $segments_dir
  $nutch_dir/hadoop dfs -rmr $mergesegs_dir

This might be the more reasonable thing to do.

                               - o -

I also tried to relaunch the crawl command, but it does not work  
because the directory already exists. The exception comes from
Crawl.java:

line 84    if (fs.exists(dir)) {
line 85      throw new RuntimeException(dir + " already exists.");
line 86    }

I tried to remove these lines. You also need to delete the
crawled/index and crawled/indexes directories before run the
crawl command again.

What are the side effects of removing those lines from Crawl.java?

                               - o -



 [1] http://nutch.wordpress.com/2007/07/13/recrawling-and-merging/
 [2] http://wiki.apache.org/nutch/IntranetRecrawl

Reply via email to