Re: Fetcher2 crashes with current trunk

Doğacan Güney Tue, 17 Feb 2009 11:40:01 -0800

2009/2/17 Sami Siren <ssi...@gmail.com>:
> Do we have a Jira issue for this, seems like a blocker for 1.0 to me if it is 
> reproducible.
>


No we don't. But you are right that we should. I am very busy and I
forgot about it. I will
examine this problem in more detail tomorrow and will open an issue if
I can reproduce
the bug.

> --
> Sami Siren
>
>
> Dog(acan Güney wrote:
>>
>> Thanks for detailed analysis. I will take a look and get back to you.
>>
>> On Mon, Feb 16, 2009 at 13:41, Koch Martina <k...@huberverlag.de> wrote:
>>
>>>
>>> Hi,
>>>
>>> sorry for the late reply. We did some further digging and found that the 
>>> error has nothing to do with Fetcher or Fetcher2. When using Fetcher, the 
>>> error just happens much later (after about 20 fetch cycles).
>>> We did many test runs, eliminated as much plugins as possible and 
>>> identified URLs which are most likely to fail.
>>> With the following configuration we get a corrupt crawldb after two fetch2 
>>> cycles:
>>> - activated plugins: protocol-http, parse-html, feed
>>> - generate.max.per.host - 100
>>> - URLs to fetch:
>>> http://www.prosieben.de/service/newsflash/
>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249
>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239
>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238
>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241
>>> http://www.prosieben.de/kino_dvd/news/60897/
>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278
>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268
>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279
>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267
>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/
>>> http://www.prosieben.de/spielfilm_serie/topstories/61051/
>>> http://www.prosieben.de/kino_dvd/news/60897/
>>>
>>> When starting from an higher URL like http://www.prosieben.de these URLs 
>>> get the following warn message after some fetch cycles:
>>> WARN  parse.ParseOutputFormat - Can't read fetch time for: 
>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
>>> But the crawldb does not get corrupt immediately after the first occurence 
>>> of such messages, it gets corrupted some cyles later.
>>>
>>> Any suggestions are highly appreciated.
>>> Something seems to go wrong with the feed plugin, but I can't diagnose 
>>> exactly when and why...
>>>
>>> Thanks in advance.
>>>
>>> Kind regards,
>>> Martina
>>>
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Dog(acan Güney [mailto:doga...@gmail.com]
>>> Gesendet: Freitag, 13. Februar 2009 09:37
>>> An: nutch-user@lucene.apache.org
>>> Betreff: Re: Fetcher2 crashes with current trunk
>>>
>>> On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina <k...@huberverlag.de> wrote:
>>>
>>>>
>>>> Hi all,
>>>>
>>>> we use the current trunk of 04.02.09 with the patch for CrawlDbMerger 
>>>> (Nutch-683) manually applied.
>>>> We're doing an inject - generate - fetch - parse - updatedb - invertlinks 
>>>> cycle at depth 1.
>>>> When we use Fetcher2, we can do this cycle four times in a row without any 
>>>> problems. If we start the fifth cycle the Injector crashes with the 
>>>> following error log:
>>>>
>>>> 2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected 
>>>> urls into crawl db.
>>>> 2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM 
>>>> Metrics with processName=JobTracker, sessionId= - already initialized
>>>> 2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths 
>>>> to process : 2
>>>> 2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: 
>>>> job_local_0002
>>>> 2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths 
>>>> to process : 2
>>>> 2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
>>>> 2009-02-12 00:00:05,554 INFO  mapred.MapTask - io.sort.mb = 100
>>>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - data buffer = 
>>>> 79691776/99614720
>>>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - record buffer = 
>>>> 262144/327680
>>>> 2009-02-12 00:00:06,538 INFO  mapred.JobClient -  map 0% reduce 0%
>>>> 2009-02-12 00:00:07,262 WARN  mapred.LocalJobRunner - job_local_0002
>>>> java.lang.RuntimeException: java.lang.NullPointerException
>>>>      at 
>>>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
>>>>      at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
>>>>      at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
>>>>      at 
>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>>>>      at 
>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>>>>      at 
>>>> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
>>>>      at 
>>>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
>>>>      at 
>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
>>>>      at 
>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
>>>>      at 
>>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
>>>>      at 
>>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
>>>>      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>>>>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
>>>>      at 
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>>>> Caused by: java.lang.NullPointerException
>>>>      at 
>>>> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>>>>      at 
>>>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
>>>>      ... 13 more
>>>> 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: 
>>>> java.io.IOException: Job failed!
>>>>      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>>>>      at org.apache.nutch.crawl.Injector.inject(Injector.java:169)
>>>>      at org.apache.nutch.crawl.Injector.run(Injector.java:190)
>>>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>      at org.apache.nutch.crawl.Injector.main(Injector.java:180)
>>>>
>>>> After that the crawldb is broken and can't be accessed e.g. with the 
>>>> readdb <crawldb> -stats command.
>>>> When we use for exactly the same task Fetcher instead of Fetcher2, we can 
>>>> do as many cycles as we like without any problems or crashes.
>>>>
>>>> Besides this error we've observed that the fetch-cycle with Fetcher is 
>>>> about twice as fast as Fetcher2, although we use the exact same settings 
>>>> in the nutch-site:
>>>> generate.max.per.host  - 100
>>>> fetcher.threads.per.host - 1
>>>> fetcher.server.delay - 0
>>>> for an initial url list with 30 URLs of different hosts.
>>>>
>>>> Has anybody observed similar errors or performance issues?
>>>>
>>>>
>>>
>>> Fetcher - Fetcher2 performance is a confusing issue. There have been
>>> reports that both
>>> have been faster than the other. Fetcher2 has a much more flexible and
>>> smarter architecture
>>> compared to Fetcher so I can only think that this is some sort of bug
>>> in Fetcher2 that degrades
>>> performance.
>>>
>>> However, your other problem (Fetcher2 crash) is very weird. I went
>>> through Fetcher and Fetcher2
>>> code and there is nothing different in them that will make one work
>>> and the other fail. Does this
>>> error consistently happen if you try it again with Fetcher2 from scratch?
>>>
>>>
>>>>
>>>> Kind regards,
>>>> Martina
>>>>
>>>>
>>>
>>> --
>>> Dog(acan Güney
>>>
>>>
>>
>>
>>
>>
>
>



-- 
Doğacan Güney

Re: Fetcher2 crashes with current trunk

Reply via email to