Stefan,
The nutch-user mailing list seems to be down, or at least unavailable to my personal account. I have spent several hours looking into creating/modifying a Intranet recrawl script for 0.8.0. I have it where it does not error out, however, when I search for something using the recrawled database, no page is returned (no error is received also). Can you look at the script (included in the attached email) and see if you notice any steps I'm missing or incorrectly ordering? Thanks.
 Matt
--- Begin Message ---
Hi all,
I reworked the recrawl script for 0.7.2 (http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html) for nutch-0.8.0-dev.

I thought I had it refactored completely, and it doesn't error out, but I must be calling some of the commands in the inproper order. Can you please take a look at it and see if you can spot what is wrong?? Thanks.

       Matt

#!/bin/bash

# A simple script to run a Nutch re-crawl

if [ -n "$1" ]
then
 crawl_dir=$1
else
 echo "Usage: recrawl crawl_dir [depth] [adddays]"
 exit 1
fi

if [ -n "$2" ]
then
 depth=$2
else
 depth=5
fi

if [ -n "$3" ]
then
 adddays=$3
else
 adddays=0
fi

webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/newsegs
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

mkdir $segments_dir

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
 bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
 segment=`ls -d $segments_dir/* | tail -1`
 bin/nutch fetch $segment
 bin/nutch updatedb $webdb_dir $segment
done

# Update segments
bin/nutch invertlinks $linkdb_dir -dir $segments_dir

# Index segments
new_indexes=$crawl_dir/newindexes
ls -d $segments_dir/* | tail -$depth | xargs bin/nutch index $new_indexes $webdb_dir $linkdb_dir

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args    expected
bin/nutch dedup $new_indexes

# Merge indexes
bin/nutch merge $index_dir $new_indexes


--- End Message ---
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to