Hi all,

we use the current trunk of 04.02.09 with the patch for CrawlDbMerger 
(Nutch-683) manually applied.
We're doing an inject - generate - fetch - parse - updatedb - invertlinks cycle 
at depth 1.
When we use Fetcher2, we can do this cycle four times in a row without any 
problems. If we start the fifth cycle the Injector crashes with the following 
error log:

2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected urls 
into crawl db.
2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM Metrics 
with processName=JobTracker, sessionId= - already initialized
2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths to 
process : 2
2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: job_local_0002
2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths to 
process : 2
2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
2009-02-12 00:00:05,554 INFO  mapred.MapTask - io.sort.mb = 100
2009-02-12 00:00:05,828 INFO  mapred.MapTask - data buffer = 79691776/99614720
2009-02-12 00:00:05,828 INFO  mapred.MapTask - record buffer = 262144/327680
2009-02-12 00:00:06,538 INFO  mapred.JobClient -  map 0% reduce 0%
2009-02-12 00:00:07,262 WARN  mapred.LocalJobRunner - job_local_0002
java.lang.RuntimeException: java.lang.NullPointerException
       at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
       at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
       at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
       at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
       at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
       at 
org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
       at 
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
       at 
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
       at 
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
       at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
       at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
       at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
Caused by: java.lang.NullPointerException
       at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
       at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
       ... 13 more
2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: java.io.IOException: 
Job failed!
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
       at org.apache.nutch.crawl.Injector.inject(Injector.java:169)
       at org.apache.nutch.crawl.Injector.run(Injector.java:190)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
       at org.apache.nutch.crawl.Injector.main(Injector.java:180)

After that the crawldb is broken and can't be accessed e.g. with the readdb 
<crawldb> -stats command.
When we use for exactly the same task Fetcher instead of Fetcher2, we can do as 
many cycles as we like without any problems or crashes.

Besides this error we've observed that the fetch-cycle with Fetcher is about 
twice as fast as Fetcher2, although we use the exact same settings in the 
nutch-site:
generate.max.per.host  - 100
fetcher.threads.per.host - 1
fetcher.server.delay - 0
for an initial url list with 30 URLs of different hosts.

Has anybody observed similar errors or performance issues?

Kind regards,
Martina

Reply via email to