Hi all, I'm having some problems with the crawling component of Nutch. I get an Out of Memory error when Nutch has finished updating the database during crawling. This ony happens when I start with a larger number of hosts and a large crawl depth, however, overall the number of pages is still quite small, i.e. < 100,000 because I'm crawling mobile content. I'm starting the crawl with approx. 3,000 hosts and a crawl depth of 8. The crawler runs successfully for about 5/6 hours and then throws an Out of Memory error when it finishes updating the database. When this happens, there are four different segment directories in the segments folder. I've included some of the logged output just before the out of memory error is thrown.
Has anyone else every had this problem? Any ideas on how to overcome this issue? I've tried increasing the heap size and I'm writing all log output to a file rather than the console. Any help would be greatly appreciated... Thanks Karen LOG OUTPUT: 050721 175024 status: segment 20050721153152.486, 49558 pages, 4344 errors, 155091562 bytes, 8269375 ms 050721 175024 status: 5.992956 pages/s, 146.5229 kb/s, 3129.496 bytes/page 050721 175025 Updating C:\nutch-0.6\crawl.test\db 050721 175027 Updating for C:\nutch-0.6\crawl.test\segments\20050721153152.486 050721 175027 Processing document 0 050721 175028 Processing document 1000 050721 175029 Processing document 2000 050721 175030 Processing document 3000 050721 175031 Processing document 4000 050721 175032 Processing document 5000 050721 175032 Processing document 6000 050721 175033 Processing document 7000 050721 175034 Processing document 8000 050721 175035 Processing document 9000 050721 175036 Processing document 10000 050721 175037 Processing document 11000 050721 175038 Processing document 12000 050721 175039 Processing document 13000 050721 175040 Processing document 14000 050721 175041 Processing document 15000 050721 175042 Processing document 16000 050721 175042 Processing document 17000 050721 175043 Processing document 18000 050721 175044 Processing document 19000 050721 175045 Processing document 20000 050721 175046 Processing document 21000 050721 175047 Processing document 22000 050721 175048 Processing document 23000 050721 175049 Processing document 24000 050721 175049 Processing document 25000 050721 175050 Processing document 26000 050721 175051 Processing document 27000 050721 175052 Processing document 28000 050721 175054 Processing document 29000 050721 175055 Processing document 30000 050721 175056 Processing document 31000 050721 175057 Processing document 32000 050721 175058 Processing document 33000 050721 175059 Processing document 34000 050721 175100 Processing document 35000 050721 175100 Processing document 36000 050721 175101 Processing document 37000 050721 175102 Processing document 38000 050721 175103 Processing document 39000 050721 175104 Processing document 40000 050721 175105 Processing document 41000 050721 175106 Processing document 42000 050721 175107 Processing document 43000 050721 175108 Processing document 44000 050721 175109 Processing document 45000 050721 175110 Processing document 46000 050721 175111 Processing document 47000 050721 175113 Processing document 48000 050721 175115 Processing document 49000 050721 175119 Processing document 50000 050721 175122 Processing document 51000 050721 175123 Processing document 52000 050721 175124 Processing document 53000 050721 175126 Finishing update
