Re: Fetcher threads & automation

Sean Dean Sun, 28 Jan 2007 03:37:06 -0800

1. I would try anything between 100 and 300 threads when using the latest trunk 
sources (I currently use 150). You don't really need that many threads, and 
with too many you might run out of stack memory.
 
2. This isn't exactly what you wanted, but you can build upon it. It should 
save you at least some time as it will complete one full cycle (generate, 
fetch, updatedb, invertlinks, and index). Most of this is basically whats 
listed in the tutorial, and remember to edit so that it matches your paths and 
config.
 
--
 
#!/usr/local/bin/bash
rm -fdr crawl/segments
bin/nutch generate crawl/crawldb crawl/segments
nseg=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $nseg
bin/nutch updatedb crawl/crawldb $nseg
bin/nutch invertlinks crawl/linkdb $nseg
rm -fdr crawl/indexes
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $nseg
cp -R crawl/indexes crawl/crawldb crawl/linkdb $nseg /tmp/nutch/crawl/
 
--

----- Original Message ----
From: Justin Hartman <[EMAIL PROTECTED]>
To: [email protected]
Sent: Sunday, January 28, 2007 4:17:49 AM
Subject: Fetcher threads & automation

Hi all

Just have a couple more questions which remain unclear to me at this stage.

1. I'm fetching urls on a P4 2.8ghz machine with 1GB ram and 100mbps
connection. Based on this config what would you recommend the maximum
fetcher threads should be?

2. Does anyone know of a script or plugin that can automate the
segment/fetch/indexing process? Basicallly I'm fetching about 20
million pages and I have to run the segment, fetch and index process
myself in a shell (which takes some time). I really would like some
sort of a shell script that I can run and the whole process can run as
a daemon in the background and I can worry about other issues.

Thank you in advance!!!!
-- 
Regards
Justin Hartman
PGP Key ID: 102CC123

Re: Fetcher threads & automation

Reply via email to