On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina <k...@huberverlag.de> wrote: > Hi all, > > we use the current trunk of 04.02.09 with the patch for CrawlDbMerger > (Nutch-683) manually applied. > We're doing an inject - generate - fetch - parse - updatedb - invertlinks > cycle at depth 1. > When we use Fetcher2, we can do this cycle four times in a row without any > problems. If we start the fifth cycle the Injector crashes with the following > error log: > > 2009-02-12 00:00:05,015 INFO crawl.Injector - Injector: Merging injected > urls into crawl db. > 2009-02-12 00:00:05,023 INFO jvm.JvmMetrics - Cannot initialize JVM Metrics > with processName=JobTracker, sessionId= - already initialized > 2009-02-12 00:00:05,358 INFO mapred.FileInputFormat - Total input paths to > process : 2 > 2009-02-12 00:00:05,524 INFO mapred.JobClient - Running job: job_local_0002 > 2009-02-12 00:00:05,528 INFO mapred.FileInputFormat - Total input paths to > process : 2 > 2009-02-12 00:00:05,553 INFO mapred.MapTask - numReduceTasks: 1 > 2009-02-12 00:00:05,554 INFO mapred.MapTask - io.sort.mb = 100 > 2009-02-12 00:00:05,828 INFO mapred.MapTask - data buffer = 79691776/99614720 > 2009-02-12 00:00:05,828 INFO mapred.MapTask - record buffer = 262144/327680 > 2009-02-12 00:00:06,538 INFO mapred.JobClient - map 0% reduce 0% > 2009-02-12 00:00:07,262 WARN mapred.LocalJobRunner - job_local_0002 > java.lang.RuntimeException: java.lang.NullPointerException > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81) > at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164) > at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) > at > org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817) > at > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790) > at > org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103) > at > org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) > Caused by: java.lang.NullPointerException > at > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73) > ... 13 more > 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: java.io.IOException: > Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) > at org.apache.nutch.crawl.Injector.inject(Injector.java:169) > at org.apache.nutch.crawl.Injector.run(Injector.java:190) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Injector.main(Injector.java:180) > > After that the crawldb is broken and can't be accessed e.g. with the readdb > <crawldb> -stats command. > When we use for exactly the same task Fetcher instead of Fetcher2, we can do > as many cycles as we like without any problems or crashes. > > Besides this error we've observed that the fetch-cycle with Fetcher is about > twice as fast as Fetcher2, although we use the exact same settings in the > nutch-site: > generate.max.per.host - 100 > fetcher.threads.per.host - 1 > fetcher.server.delay - 0 > for an initial url list with 30 URLs of different hosts. > > Has anybody observed similar errors or performance issues? >
Fetcher - Fetcher2 performance is a confusing issue. There have been reports that both have been faster than the other. Fetcher2 has a much more flexible and smarter architecture compared to Fetcher so I can only think that this is some sort of bug in Fetcher2 that degrades performance. However, your other problem (Fetcher2 crash) is very weird. I went through Fetcher and Fetcher2 code and there is nothing different in them that will make one work and the other fail. Does this error consistently happen if you try it again with Fetcher2 from scratch? > Kind regards, > Martina > -- Doğacan Güney