On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina <k...@huberverlag.de> wrote:
> Hi all,
>
> we use the current trunk of 04.02.09 with the patch for CrawlDbMerger 
> (Nutch-683) manually applied.
> We're doing an inject - generate - fetch - parse - updatedb - invertlinks 
> cycle at depth 1.
> When we use Fetcher2, we can do this cycle four times in a row without any 
> problems. If we start the fifth cycle the Injector crashes with the following 
> error log:
>
> 2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected 
> urls into crawl db.
> 2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM Metrics 
> with processName=JobTracker, sessionId= - already initialized
> 2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths to 
> process : 2
> 2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: job_local_0002
> 2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths to 
> process : 2
> 2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
> 2009-02-12 00:00:05,554 INFO  mapred.MapTask - io.sort.mb = 100
> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - data buffer = 79691776/99614720
> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - record buffer = 262144/327680
> 2009-02-12 00:00:06,538 INFO  mapred.JobClient -  map 0% reduce 0%
> 2009-02-12 00:00:07,262 WARN  mapred.LocalJobRunner - job_local_0002
> java.lang.RuntimeException: java.lang.NullPointerException
>       at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
>       at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
>       at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
>       at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>       at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
>       at 
> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
>       at 
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
>       at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
>       at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
> Caused by: java.lang.NullPointerException
>       at 
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>       at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
>       ... 13 more
> 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: java.io.IOException: 
> Job failed!
>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>       at org.apache.nutch.crawl.Injector.inject(Injector.java:169)
>       at org.apache.nutch.crawl.Injector.run(Injector.java:190)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>       at org.apache.nutch.crawl.Injector.main(Injector.java:180)
>
> After that the crawldb is broken and can't be accessed e.g. with the readdb 
> <crawldb> -stats command.
> When we use for exactly the same task Fetcher instead of Fetcher2, we can do 
> as many cycles as we like without any problems or crashes.
>
> Besides this error we've observed that the fetch-cycle with Fetcher is about 
> twice as fast as Fetcher2, although we use the exact same settings in the 
> nutch-site:
> generate.max.per.host  - 100
> fetcher.threads.per.host - 1
> fetcher.server.delay - 0
> for an initial url list with 30 URLs of different hosts.
>
> Has anybody observed similar errors or performance issues?
>

Fetcher - Fetcher2 performance is a confusing issue. There have been
reports that both
have been faster than the other. Fetcher2 has a much more flexible and
smarter architecture
compared to Fetcher so I can only think that this is some sort of bug
in Fetcher2 that degrades
performance.

However, your other problem (Fetcher2 crash) is very weird. I went
through Fetcher and Fetcher2
code and there is nothing different in them that will make one work
and the other fail. Does this
error consistently happen if you try it again with Fetcher2 from scratch?

> Kind regards,
> Martina
>



-- 
Doğacan Güney

Reply via email to