1. I would try anything between 100 and 300 threads when using the latest trunk sources (I currently use 150). You don't really need that many threads, and with too many you might run out of stack memory. 2. This isn't exactly what you wanted, but you can build upon it. It should save you at least some time as it will complete one full cycle (generate, fetch, updatedb, invertlinks, and index). Most of this is basically whats listed in the tutorial, and remember to edit so that it matches your paths and config. -- #!/usr/local/bin/bash rm -fdr crawl/segments bin/nutch generate crawl/crawldb crawl/segments nseg=`ls -d crawl/segments/2* | tail -1` bin/nutch fetch $nseg bin/nutch updatedb crawl/crawldb $nseg bin/nutch invertlinks crawl/linkdb $nseg rm -fdr crawl/indexes bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $nseg cp -R crawl/indexes crawl/crawldb crawl/linkdb $nseg /tmp/nutch/crawl/ --
----- Original Message ---- From: Justin Hartman <[EMAIL PROTECTED]> To: [email protected] Sent: Sunday, January 28, 2007 4:17:49 AM Subject: Fetcher threads & automation Hi all Just have a couple more questions which remain unclear to me at this stage. 1. I'm fetching urls on a P4 2.8ghz machine with 1GB ram and 100mbps connection. Based on this config what would you recommend the maximum fetcher threads should be? 2. Does anyone know of a script or plugin that can automate the segment/fetch/indexing process? Basicallly I'm fetching about 20 million pages and I have to run the segment, fetch and index process myself in a shell (which takes some time). I really would like some sort of a shell script that I can run and the whole process can run as a daemon in the background and I can worry about other issues. Thank you in advance!!!! -- Regards Justin Hartman PGP Key ID: 102CC123
