Re: [Nutch-general] Fetcher threads & automation

Sean Dean Sun, 28 Jan 2007 04:44:40 -0800

I actually keep my fetcher.threads.per.host value at the default 1, but have 
tried up to 3 without any noticeable errors strictly based on this setting. I 
guess this comes into play more if your fetching from many of the same hosts, 
in which you might want to cheat and raise the setting up a notch but in doing 
so you might see more http-related errors, as you have witnessed.
 
Yes, it creates one segment and does all the work on it, then moves it to 
another directory. When you run the script again, it deletes the old segment 
data (to free up space, since its copied anyway) and repeats the cycle on a 
brand new segment.
 
Now that I look at it (I honestly just wrote that in the email to you without 
testing) you should build upon this instead;
 
--


#!/usr/local/bin/bash
rm -fdr crawl/segments crawl/indexes
bin/nutch generate crawl/crawldb crawl/segments
nseg=`ls -d crawl/segments/*`
bin/nutch fetch $nseg
bin/nutch updatedb crawl/crawldb $nseg
bin/nutch invertlinks crawl/linkdb $nseg
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $nseg
cp -R crawl/indexes crawl/crawldb crawl/linkdb /tmp/nutch/crawl/
cp -R $nseg /tmp/nutch/crawl/segments/

--
 
You can use this tool to delete "any" document from your Nutch (Lucene) index; 
http://www.getopt.org/luke/.


----- Original Message ----
From: Justin Hartman <[EMAIL PROTECTED]>
To: [email protected]
Sent: Sunday, January 28, 2007 7:07:23 AM
Subject: Re: Fetcher threads & automation


Hi Sean

Firstly thanks for the input - it is much appreciated!

> 1. I would try anything between 100 and 300 threads when using the latest 
> trunk sources (I currently use 150). You don't really need that many threads, 
> and with too many you might run out of stack memory.

What is your recommendation with threads per host? I was running 10
but then I noticed one site that I was indexing had a 500 server error
stating that "there were too many connections to localhost".

The last thing I want to do is create a DoS attack on webservers so I
reduced this to 5 but not sure what the recommended is.

> 2. This isn't exactly what you wanted, but you can build upon it. It should 
> save you at least some time as it will complete one full cycle (generate, 
> fetch, updatedb, invertlinks, and index). Most of this is basically whats 
> listed in the tutorial, and remember to edit so that it matches your paths 
> and config.

When you say it will generate one full cycle do you mean that only one
segment will be created and then the fetch, updatedb, invertlinks, and
index from that one segment?

One last question I would also like to ask is can a URL be deleted
from a segment and/or index once it has been fetched or will the whole
index need to be re-created?
-- 
Regards
Justin Hartman
PGP Key ID: 102CC123

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Fetcher threads & automation

Reply via email to