Hi,
I believe this script will work on a Nutch that works on local system only
and not as distributed system?
Attached two script I use for DFS.
BTW, my scripting skills are not that great so remarks and improvements will
be welcome.
############################### crawler.sh
#!/bin/bash
# limit the number of crawls
LIMIT=200000
# counter which counts the number of crawls
counter=1
# start the loop
while (( counter <= LIMIT ))
do
# get a new name for a new log
log="logs/crawl-$(date +%m-%d-%Y-%H%M%S).log"
# call the one_crawl.sh script
./one_crawl.sh > $log
# was there an error? exit
if [ "$?" != "0" ]; then
{
echo "ERROR!!! EXITING!!!"
exit 1;
}
fi
# check if a file stop.stop exists if it does, exit
if [ -f "stop.stop" ]; then
{
echo "$(date +%m-%d-%Y-%H%M%S)> stop.stop file was found. DONE..." > $log
exit 0
}
fi
echo "$(date +%m-%d-%Y-%H%M%S)> LOOP $counter IS DONE..." >> $log
# increase counter by one
((counter += 1))
done
echo DONE......................
exit 0;
############################### DONE crawler.sh
############################### one_crawl.sh
#!/bin/bash
CRAWLDB='crawldb'
LINKDB='linkdb'
SEGMENTS='segments'
INDEX='index'
INDEXES='indexes'
cd ~/trunk
pwd
# generate a segment
bin/nutch generate $CRAWLDB $SEGMENTS
# check if an error occured
if [ "$?" != "0" ]; then
{
echo "ERROR!!! EXITING!!!"
exit 1;
}
Fi
# if file stop.stop exist exit
if [ -f "stop.stop" ]; then
{
echo "$(date +%m-%d-%Y-%H%M%S)> stop.stop file was found. DONE..."
exit 0
}
fi
# get segment name generated
seg=$(bin/hadoop dfs -ls $SEGMENTS |tail -1| gawk '{ print $1 }')
# fetch the segment
bin/nutch fetch $seg
if [ "$?" != "0" ]; then
{
echo "ERROR!!! EXITING!!!"
exit 1;
}
fi
if [ -f "stop.stop" ]; then
{
echo "$(date +%m-%d-%Y-%H%M%S)> stop.stop file was found. DONE..."
exit 0
}
fi
# fetch is done, update crawldb
bin/nutch updatedb $CRAWLDB $seg -filter -normalize
if [ "$?" != "0" ]; then
{
echo "ERROR!!! EXITING!!!"
exit 1;
}
fi
if [ -f "stop.stop" ]; then
{
echo "$(date +%m-%d-%Y-%H%M%S)> stop.stop file was found. DONE..."
exit 0
}
fi
# update linkdb
bin/nutch invertlinks $LINKDB $seg
if [ "$?" != "0" ]; then
{
echo "ERROR!!! EXITING!!!"
exit 1;
}
fi
if [ -f "stop.stop" ]; then
{
echo "$(date +%m-%d-%Y-%H%M%S)> stop.stop file was found. DONE..."
exit 0
}
fi
############################### DONEone_crawl.sh
-----Original Message-----
From: Sean Dean [mailto:[EMAIL PROTECTED]
Sent: Sunday, January 28, 2007 2:44 PM
To: [email protected]
Subject: Re: Fetcher threads & automation
I actually keep my fetcher.threads.per.host value at the default 1, but have
tried up to 3 without any noticeable errors strictly based on this setting.
I guess this comes into play more if your fetching from many of the same
hosts, in which you might want to cheat and raise the setting up a notch but
in doing so you might see more http-related errors, as you have witnessed.
Yes, it creates one segment and does all the work on it, then moves it to
another directory. When you run the script again, it deletes the old segment
data (to free up space, since its copied anyway) and repeats the cycle on a
brand new segment.
Now that I look at it (I honestly just wrote that in the email to you
without testing) you should build upon this instead;
--
#!/usr/local/bin/bash
rm -fdr crawl/segments crawl/indexes
bin/nutch generate crawl/crawldb crawl/segments
nseg=`ls -d crawl/segments/*`
bin/nutch fetch $nseg
bin/nutch updatedb crawl/crawldb $nseg
bin/nutch invertlinks crawl/linkdb $nseg
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $nseg
cp -R crawl/indexes crawl/crawldb crawl/linkdb /tmp/nutch/crawl/
cp -R $nseg /tmp/nutch/crawl/segments/
--
You can use this tool to delete "any" document from your Nutch (Lucene)
index; http://www.getopt.org/luke/.
----- Original Message ----
From: Justin Hartman <[EMAIL PROTECTED]>
To: [email protected]
Sent: Sunday, January 28, 2007 7:07:23 AM
Subject: Re: Fetcher threads & automation
Hi Sean
Firstly thanks for the input - it is much appreciated!
> 1. I would try anything between 100 and 300 threads when using the latest
trunk sources (I currently use 150). You don't really need that many
threads, and with too many you might run out of stack memory.
What is your recommendation with threads per host? I was running 10
but then I noticed one site that I was indexing had a 500 server error
stating that "there were too many connections to localhost".
The last thing I want to do is create a DoS attack on webservers so I
reduced this to 5 but not sure what the recommended is.
> 2. This isn't exactly what you wanted, but you can build upon it. It
should save you at least some time as it will complete one full cycle
(generate, fetch, updatedb, invertlinks, and index). Most of this is
basically whats listed in the tutorial, and remember to edit so that it
matches your paths and config.
When you say it will generate one full cycle do you mean that only one
segment will be created and then the fetch, updatedb, invertlinks, and
index from that one segment?
One last question I would also like to ask is can a URL be deleted
from a segment and/or index once it has been fetched or will the whole
index need to be re-created?
--
Regards
Justin Hartman
PGP Key ID: 102CC123
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general