I've installed nutch 0.8.1 on a single node (a P4 3.0Ghz dual processor with RAM 1,5GB ) and my configuration is

threads = 20
depth = 1000
topN = 1000

on linux 2.6.9.

I'm trying an intranet crawling on a site with more than 50000 pages.

I've launched JVM with standard heap options  JAVA_HEAP_MAX=-Xmx1600m

After 10 hours of crawling, my logs contains a lot of OutOfMemoryError

2007-07-18 10:14:32,535 INFO fetcher.Fetcher - fetch of http://www.anci.it/stampa.cfm?layout=dettaglio&IdSez=2446&IdDett=5936 failed with: java.lang.OutOfMemoryError

and nutch process died.

Anyone has this kind of experience? I believe I could modify JAVA_HEAP_MAX, but probably the problem could appear another time subsequently. Perhaps is the problem related to my configuration or is a memory leak of Fetcher class ? Do I need to install a distributed configuration with hadoop ?

Pierluigi D'Amadio






-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to