Re: Speeding things up!

Sami Siren Sat, 28 Oct 2006 23:51:35 -0700

forgot one important one:

set "generate.max.per.host" to something reasonable so you won't end upfetching urls from only low number of hosts which by default is very slow.


--
 Sami Siren

Sami Siren wrote:

Some simple rules for generally speeding things up
1. Crawl only the content you are going to handle handle (do not fetchfor example pdf-files if you don't need them, also disable all unneededparsers)
2. If using regex-urlfilter: If you don't need the rule
"-.*(/.+?)/.*?\1/.*?\1/" remove it (also keep the number of rules assmall as possible still remembering #1 and #3)
3. Check your parser configuration (SEE NUTCH-362) so your CPU won't endup parsing all kinds of binary content with text parser.
You might also check the variables like "fetcher.server.delay" and"fetcher.threads.per.host". (and remember to keep your fetcher polite!)
I am using something like 300 for "fetcher.threads" for fetching with0.8.1 single athlon 64, 1 GB of memory.
I am also in process of fixing some IO related bottlenecks and will getback to that hopefully sooner than later.
--
 Sami Siren




Marco Vanossi wrote:
Hi,

Do you have some hints that would improve speed for the following nutch
commands?

./nutch generate db segments -topN 10000000
s=`ls -d segments/2* | tail -1`
./nutch fetch $s
./nutch updatedb db $s
./nutch index $s
./nutch dedup segments tmpfile

I mean, do you have some hints for the numbers set in
nutch-default.xmlfor, for example:
fetcher.threads (I'm using 10.000), etc....
Let's say it is running on a machine with 12GB RAM, and 2.000GB HD.

Thank you very much for any help.

Marco

Re: Speeding things up!

Reply via email to