Hi all,

I'm having some problems with the crawling component of Nutch.  I get an Out of 
Memory error when Nutch has finished updating the database during crawling.  
This ony happens when I start with a larger number of hosts and a large crawl 
depth, however, overall the number of pages is still quite small, i.e. < 
100,000 because I'm crawling mobile content. I'm starting the crawl with 
approx. 3,000 hosts and a crawl depth of 8.  The crawler runs successfully for 
about 5/6 hours and then throws an Out of Memory error when it finishes 
updating the database.  When this happens, there are four different segment 
directories in the segments folder.  I've included some of the logged output 
just before the out of memory error is thrown.

Has anyone else every had this problem? Any ideas on how to overcome this issue?

I've tried increasing the heap size and I'm writing all log output to a file 
rather than the console.

Any help would be greatly appreciated...

Thanks

Karen

LOG OUTPUT:

050721 175024 status: segment 20050721153152.486, 49558 pages, 4344 errors, 
155091562 bytes, 8269375 ms
050721 175024 status: 5.992956 pages/s, 146.5229 kb/s, 3129.496 bytes/page
050721 175025 Updating C:\nutch-0.6\crawl.test\db
050721 175027 Updating for C:\nutch-0.6\crawl.test\segments\20050721153152.486
050721 175027 Processing document 0
050721 175028 Processing document 1000
050721 175029 Processing document 2000
050721 175030 Processing document 3000
050721 175031 Processing document 4000
050721 175032 Processing document 5000
050721 175032 Processing document 6000
050721 175033 Processing document 7000
050721 175034 Processing document 8000
050721 175035 Processing document 9000
050721 175036 Processing document 10000
050721 175037 Processing document 11000
050721 175038 Processing document 12000
050721 175039 Processing document 13000
050721 175040 Processing document 14000
050721 175041 Processing document 15000
050721 175042 Processing document 16000
050721 175042 Processing document 17000
050721 175043 Processing document 18000
050721 175044 Processing document 19000
050721 175045 Processing document 20000
050721 175046 Processing document 21000
050721 175047 Processing document 22000
050721 175048 Processing document 23000
050721 175049 Processing document 24000
050721 175049 Processing document 25000
050721 175050 Processing document 26000
050721 175051 Processing document 27000
050721 175052 Processing document 28000
050721 175054 Processing document 29000
050721 175055 Processing document 30000
050721 175056 Processing document 31000
050721 175057 Processing document 32000
050721 175058 Processing document 33000
050721 175059 Processing document 34000
050721 175100 Processing document 35000
050721 175100 Processing document 36000
050721 175101 Processing document 37000
050721 175102 Processing document 38000
050721 175103 Processing document 39000
050721 175104 Processing document 40000
050721 175105 Processing document 41000
050721 175106 Processing document 42000
050721 175107 Processing document 43000
050721 175108 Processing document 44000
050721 175109 Processing document 45000
050721 175110 Processing document 46000
050721 175111 Processing document 47000
050721 175113 Processing document 48000
050721 175115 Processing document 49000
050721 175119 Processing document 50000
050721 175122 Processing document 51000
050721 175123 Processing document 52000
050721 175124 Processing document 53000
050721 175126 Finishing update

Reply via email to