forgot one important one:

set "generate.max.per.host" to something reasonable so you won't end up fetching urls from only low number of hosts which by default is very slow.

--
 Sami Siren

Sami Siren wrote:
Some simple rules for generally speeding things up

1. Crawl only the content you are going to handle handle (do not fetch for example pdf-files if you don't need them, also disable all unneeded parsers)

2. If using regex-urlfilter: If you don't need the rule
"-.*(/.+?)/.*?\1/.*?\1/" remove it (also keep the number of rules as small as possible still remembering #1 and #3)

3. Check your parser configuration (SEE NUTCH-362) so your CPU won't end up parsing all kinds of binary content with text parser.

You might also check the variables like "fetcher.server.delay" and "fetcher.threads.per.host". (and remember to keep your fetcher polite!)

I am using something like 300 for "fetcher.threads" for fetching with 0.8.1 single athlon 64, 1 GB of memory.

I am also in process of fixing some IO related bottlenecks and will get back to that hopefully sooner than later.

--
 Sami Siren




Marco Vanossi wrote:
Hi,

Do you have some hints that would improve speed for the following nutch
commands?

./nutch generate db segments -topN 10000000
s=`ls -d segments/2* | tail -1`
./nutch fetch $s
./nutch updatedb db $s
./nutch index $s
./nutch dedup segments tmpfile

I mean, do you have some hints for the numbers set in
nutch-default.xmlfor, for example:
fetcher.threads (I'm using 10.000), etc....
Let's say it is running on a machine with 12GB RAM, and 2.000GB HD.

Thank you very much for any help.

Marco




Reply via email to