Hi,
I am using nutch-1.0-dev to crawl an Intranet.
My problem is recrawling. I found interesting pointers on a
weblog post [1], but no solution on how to do a recrawl properly.
There is a script [2] on the Nutch wiki but it does not work
with the Hadoop Distributed File System.
- o -
The script can be fixed changing all the commands using the local
file system, for example:
cp -R $mergesegs_dir/* $segments_dir
rm -rf $mergesegs_dir
into the corresponding distributed file system version:
$nutch_dir/hadoop dfs -cp $mergesegs_dir/* $segments_dir
$nutch_dir/hadoop dfs -rmr $mergesegs_dir
This might be the more reasonable thing to do.
- o -
I also tried to relaunch the crawl command, but it does not work
because the directory already exists. The exception comes from
Crawl.java:
line 84 if (fs.exists(dir)) {
line 85 throw new RuntimeException(dir + " already exists.");
line 86 }
I tried to remove these lines. You also need to delete the
crawled/index and crawled/indexes directories before run the
crawl command again.
What are the side effects of removing those lines from Crawl.java?
- o -
[1] http://nutch.wordpress.com/2007/07/13/recrawling-and-merging/
[2] http://wiki.apache.org/nutch/IntranetRecrawl