Some simple rules for generally speeding things up

1. Crawl only the content you are going to handle handle (do not fetch 
for example pdf-files if you don't need them, also disable all unneeded 
parsers)

2. If using regex-urlfilter: If you don't need the rule
"-.*(/.+?)/.*?\1/.*?\1/" remove it (also keep the number of rules as 
small as possible still remembering #1 and #3)

3. Check your parser configuration (SEE NUTCH-362) so your CPU won't end 
up parsing all kinds of binary content with text parser.

You might also check the variables like "fetcher.server.delay" and 
"fetcher.threads.per.host". (and remember to keep your fetcher polite!)

I am using something like 300 for "fetcher.threads" for fetching with 
0.8.1 single athlon 64, 1 GB of memory.

I am also in process of fixing some IO related bottlenecks and will get 
back to that hopefully sooner than later.

--
  Sami Siren




Marco Vanossi wrote:
> Hi,
> 
> Do you have some hints that would improve speed for the following nutch
> commands?
> 
> ./nutch generate db segments -topN 10000000
> s=`ls -d segments/2* | tail -1`
> ./nutch fetch $s
> ./nutch updatedb db $s
> ./nutch index $s
> ./nutch dedup segments tmpfile
> 
> I mean, do you have some hints for the numbers set in
> nutch-default.xmlfor, for example:
> fetcher.threads (I'm using 10.000), etc....
> Let's say it is running on a machine with 12GB RAM, and 2.000GB HD.
> 
> Thank you very much for any help.
> 
> Marco
> 


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to