Hi all,
I reworked the recrawl script for 0.7.2
(http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html)
for nutch-0.8.0-dev.
I thought I had it refactored completely, and it doesn't error out, but
I must be calling some of the commands in the inproper order. Can you
please take a look at it and see if you can spot what is wrong?? Thanks.
Matt
#!/bin/bash
# A simple script to run a Nutch re-crawl
if [ -n "$1" ]
then
crawl_dir=$1
else
echo "Usage: recrawl crawl_dir [depth] [adddays]"
exit 1
fi
if [ -n "$2" ]
then
depth=$2
else
depth=5
fi
if [ -n "$3" ]
then
adddays=$3
else
adddays=0
fi
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/newsegs
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index
mkdir $segments_dir
# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
segment=`ls -d $segments_dir/* | tail -1`
bin/nutch fetch $segment
bin/nutch updatedb $webdb_dir $segment
done
# Update segments
bin/nutch invertlinks $linkdb_dir -dir $segments_dir
# Index segments
new_indexes=$crawl_dir/newindexes
ls -d $segments_dir/* | tail -$depth | xargs bin/nutch index
$new_indexes $webdb_dir $linkdb_dir
# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup $new_indexes
# Merge indexes
bin/nutch merge $index_dir $new_indexes
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general