Stefan,
The nutch-user mailing list seems to be down, or at least unavailable
to my personal account. I have spent several hours looking into
creating/modifying a Intranet recrawl script for 0.8.0. I have it where
it does not error out, however, when I search for something using the
recrawled database, no page is returned (no error is received also). Can
you look at the script (included in the attached email) and see if you
notice any steps I'm missing or incorrectly ordering? Thanks.
Matt
--- Begin Message ---
Hi all,
I reworked the recrawl script for 0.7.2
(http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html)
for nutch-0.8.0-dev.
I thought I had it refactored completely, and it doesn't error out, but
I must be calling some of the commands in the inproper order. Can you
please take a look at it and see if you can spot what is wrong?? Thanks.
Matt
#!/bin/bash
# A simple script to run a Nutch re-crawl
if [ -n "$1" ]
then
crawl_dir=$1
else
echo "Usage: recrawl crawl_dir [depth] [adddays]"
exit 1
fi
if [ -n "$2" ]
then
depth=$2
else
depth=5
fi
if [ -n "$3" ]
then
adddays=$3
else
adddays=0
fi
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/newsegs
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index
mkdir $segments_dir
# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
segment=`ls -d $segments_dir/* | tail -1`
bin/nutch fetch $segment
bin/nutch updatedb $webdb_dir $segment
done
# Update segments
bin/nutch invertlinks $linkdb_dir -dir $segments_dir
# Index segments
new_indexes=$crawl_dir/newindexes
ls -d $segments_dir/* | tail -$depth | xargs bin/nutch index
$new_indexes $webdb_dir $linkdb_dir
# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup $new_indexes
# Merge indexes
bin/nutch merge $index_dir $new_indexes
--- End Message ---
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general