Hi all, we use the current trunk of 04.02.09 with the patch for CrawlDbMerger (Nutch-683) manually applied. We're doing an inject - generate - fetch - parse - updatedb - invertlinks cycle at depth 1. When we use Fetcher2, we can do this cycle four times in a row without any problems. If we start the fifth cycle the Injector crashes with the following error log:
2009-02-12 00:00:05,015 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2009-02-12 00:00:05,023 INFO jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2009-02-12 00:00:05,358 INFO mapred.FileInputFormat - Total input paths to process : 2 2009-02-12 00:00:05,524 INFO mapred.JobClient - Running job: job_local_0002 2009-02-12 00:00:05,528 INFO mapred.FileInputFormat - Total input paths to process : 2 2009-02-12 00:00:05,553 INFO mapred.MapTask - numReduceTasks: 1 2009-02-12 00:00:05,554 INFO mapred.MapTask - io.sort.mb = 100 2009-02-12 00:00:05,828 INFO mapred.MapTask - data buffer = 79691776/99614720 2009-02-12 00:00:05,828 INFO mapred.MapTask - record buffer = 262144/327680 2009-02-12 00:00:06,538 INFO mapred.JobClient - map 0% reduce 0% 2009-02-12 00:00:07,262 WARN mapred.LocalJobRunner - job_local_0002 java.lang.RuntimeException: java.lang.NullPointerException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81) at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164) at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790) at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) Caused by: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73) ... 13 more 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at org.apache.nutch.crawl.Injector.inject(Injector.java:169) at org.apache.nutch.crawl.Injector.run(Injector.java:190) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Injector.main(Injector.java:180) After that the crawldb is broken and can't be accessed e.g. with the readdb <crawldb> -stats command. When we use for exactly the same task Fetcher instead of Fetcher2, we can do as many cycles as we like without any problems or crashes. Besides this error we've observed that the fetch-cycle with Fetcher is about twice as fast as Fetcher2, although we use the exact same settings in the nutch-site: generate.max.per.host - 100 fetcher.threads.per.host - 1 fetcher.server.delay - 0 for an initial url list with 30 URLs of different hosts. Has anybody observed similar errors or performance issues? Kind regards, Martina