forgot one important one:

set "generate.max.per.host" to something reasonable so you won't end up 
fetching urls from only low number of hosts which by default is very slow.

--
  Sami Siren

Sami Siren wrote:
> Some simple rules for generally speeding things up
> 
> 1. Crawl only the content you are going to handle handle (do not fetch 
> for example pdf-files if you don't need them, also disable all unneeded 
> parsers)
> 
> 2. If using regex-urlfilter: If you don't need the rule
> "-.*(/.+?)/.*?\1/.*?\1/" remove it (also keep the number of rules as 
> small as possible still remembering #1 and #3)
> 
> 3. Check your parser configuration (SEE NUTCH-362) so your CPU won't end 
> up parsing all kinds of binary content with text parser.
> 
> You might also check the variables like "fetcher.server.delay" and 
> "fetcher.threads.per.host". (and remember to keep your fetcher polite!)
> 
> I am using something like 300 for "fetcher.threads" for fetching with 
> 0.8.1 single athlon 64, 1 GB of memory.
> 
> I am also in process of fixing some IO related bottlenecks and will get 
> back to that hopefully sooner than later.
> 
> -- 
>  Sami Siren
> 
> 
> 
> 
> Marco Vanossi wrote:
>> Hi,
>>
>> Do you have some hints that would improve speed for the following nutch
>> commands?
>>
>> ./nutch generate db segments -topN 10000000
>> s=`ls -d segments/2* | tail -1`
>> ./nutch fetch $s
>> ./nutch updatedb db $s
>> ./nutch index $s
>> ./nutch dedup segments tmpfile
>>
>> I mean, do you have some hints for the numbers set in
>> nutch-default.xmlfor, for example:
>> fetcher.threads (I'm using 10.000), etc....
>> Let's say it is running on a machine with 12GB RAM, and 2.000GB HD.
>>
>> Thank you very much for any help.
>>
>> Marco
>>
> 
> 


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to