Some simple rules for generally speeding things up 1. Crawl only the content you are going to handle handle (do not fetch for example pdf-files if you don't need them, also disable all unneeded parsers)
2. If using regex-urlfilter: If you don't need the rule "-.*(/.+?)/.*?\1/.*?\1/" remove it (also keep the number of rules as small as possible still remembering #1 and #3) 3. Check your parser configuration (SEE NUTCH-362) so your CPU won't end up parsing all kinds of binary content with text parser. You might also check the variables like "fetcher.server.delay" and "fetcher.threads.per.host". (and remember to keep your fetcher polite!) I am using something like 300 for "fetcher.threads" for fetching with 0.8.1 single athlon 64, 1 GB of memory. I am also in process of fixing some IO related bottlenecks and will get back to that hopefully sooner than later. -- Sami Siren Marco Vanossi wrote: > Hi, > > Do you have some hints that would improve speed for the following nutch > commands? > > ./nutch generate db segments -topN 10000000 > s=`ls -d segments/2* | tail -1` > ./nutch fetch $s > ./nutch updatedb db $s > ./nutch index $s > ./nutch dedup segments tmpfile > > I mean, do you have some hints for the numbers set in > nutch-default.xmlfor, for example: > fetcher.threads (I'm using 10.000), etc.... > Let's say it is running on a machine with 12GB RAM, and 2.000GB HD. > > Thank you very much for any help. > > Marco > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
