Thanks for detailed analysis. I will take a look and get back to you.

On Mon, Feb 16, 2009 at 13:41, Koch Martina <k...@huberverlag.de> wrote:
> Hi,
>
> sorry for the late reply. We did some further digging and found that the 
> error has nothing to do with Fetcher or Fetcher2. When using Fetcher, the 
> error just happens much later (after about 20 fetch cycles).
> We did many test runs, eliminated as much plugins as possible and identified 
> URLs which are most likely to fail.
> With the following configuration we get a corrupt crawldb after two fetch2 
> cycles:
> - activated plugins: protocol-http, parse-html, feed
> - generate.max.per.host - 100
> - URLs to fetch:
> http://www.prosieben.de/service/newsflash/
> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249
> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239
> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238
> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241
> http://www.prosieben.de/kino_dvd/news/60897/
> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278
> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268
> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279
> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267
> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/
> http://www.prosieben.de/spielfilm_serie/topstories/61051/
> http://www.prosieben.de/kino_dvd/news/60897/
>
> When starting from an higher URL like http://www.prosieben.de these URLs get 
> the following warn message after some fetch cycles:
> WARN  parse.ParseOutputFormat - Can't read fetch time for: 
> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
> But the crawldb does not get corrupt immediately after the first occurence of 
> such messages, it gets corrupted some cyles later.
>
> Any suggestions are highly appreciated.
> Something seems to go wrong with the feed plugin, but I can't diagnose 
> exactly when and why...
>
> Thanks in advance.
>
> Kind regards,
> Martina
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Doğacan Güney [mailto:doga...@gmail.com]
> Gesendet: Freitag, 13. Februar 2009 09:37
> An: nutch-user@lucene.apache.org
> Betreff: Re: Fetcher2 crashes with current trunk
>
> On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina <k...@huberverlag.de> wrote:
>> Hi all,
>>
>> we use the current trunk of 04.02.09 with the patch for CrawlDbMerger 
>> (Nutch-683) manually applied.
>> We're doing an inject - generate - fetch - parse - updatedb - invertlinks 
>> cycle at depth 1.
>> When we use Fetcher2, we can do this cycle four times in a row without any 
>> problems. If we start the fifth cycle the Injector crashes with the 
>> following error log:
>>
>> 2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected 
>> urls into crawl db.
>> 2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM Metrics 
>> with processName=JobTracker, sessionId= - already initialized
>> 2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths to 
>> process : 2
>> 2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: job_local_0002
>> 2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths to 
>> process : 2
>> 2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
>> 2009-02-12 00:00:05,554 INFO  mapred.MapTask - io.sort.mb = 100
>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - data buffer = 
>> 79691776/99614720
>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - record buffer = 262144/327680
>> 2009-02-12 00:00:06,538 INFO  mapred.JobClient -  map 0% reduce 0%
>> 2009-02-12 00:00:07,262 WARN  mapred.LocalJobRunner - job_local_0002
>> java.lang.RuntimeException: java.lang.NullPointerException
>>       at 
>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
>>       at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
>>       at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
>>       at 
>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>>       at 
>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>>       at 
>> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
>>       at 
>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
>>       at 
>> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
>>       at 
>> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
>>       at 
>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
>>       at 
>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
>>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
>>       at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>> Caused by: java.lang.NullPointerException
>>       at 
>> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>>       at 
>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
>>       ... 13 more
>> 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: 
>> java.io.IOException: Job failed!
>>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>>       at org.apache.nutch.crawl.Injector.inject(Injector.java:169)
>>       at org.apache.nutch.crawl.Injector.run(Injector.java:190)
>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>       at org.apache.nutch.crawl.Injector.main(Injector.java:180)
>>
>> After that the crawldb is broken and can't be accessed e.g. with the readdb 
>> <crawldb> -stats command.
>> When we use for exactly the same task Fetcher instead of Fetcher2, we can do 
>> as many cycles as we like without any problems or crashes.
>>
>> Besides this error we've observed that the fetch-cycle with Fetcher is about 
>> twice as fast as Fetcher2, although we use the exact same settings in the 
>> nutch-site:
>> generate.max.per.host  - 100
>> fetcher.threads.per.host - 1
>> fetcher.server.delay - 0
>> for an initial url list with 30 URLs of different hosts.
>>
>> Has anybody observed similar errors or performance issues?
>>
>
> Fetcher - Fetcher2 performance is a confusing issue. There have been
> reports that both
> have been faster than the other. Fetcher2 has a much more flexible and
> smarter architecture
> compared to Fetcher so I can only think that this is some sort of bug
> in Fetcher2 that degrades
> performance.
>
> However, your other problem (Fetcher2 crash) is very weird. I went
> through Fetcher and Fetcher2
> code and there is nothing different in them that will make one work
> and the other fail. Does this
> error consistently happen if you try it again with Fetcher2 from scratch?
>
>> Kind regards,
>> Martina
>>
>
>
>
> --
> Doğacan Güney
>



-- 
Doğacan Güney

Reply via email to