Re: Fetcher2 crashes with current trunk

Doğacan Güney Thu, 19 Feb 2009 03:42:42 -0800

I think I have found the bug here, but I am in a hurry now, I will
create a JIRA issue
and post (what is hopefully) the fix later today.


On Tue, Feb 17, 2009 at 21:39, Doğacan Güney <doga...@gmail.com> wrote:
> 2009/2/17 Sami Siren <ssi...@gmail.com>:
>> Do we have a Jira issue for this, seems like a blocker for 1.0 to me if it 
>> is reproducible.
>>
>
> No we don't. But you are right that we should. I am very busy and I
> forgot about it. I will
> examine this problem in more detail tomorrow and will open an issue if
> I can reproduce
> the bug.
>
>> --
>> Sami Siren
>>
>>
>> Dog(acan Güney wrote:
>>>
>>> Thanks for detailed analysis. I will take a look and get back to you.
>>>
>>> On Mon, Feb 16, 2009 at 13:41, Koch Martina <k...@huberverlag.de> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> sorry for the late reply. We did some further digging and found that the 
>>>> error has nothing to do with Fetcher or Fetcher2. When using Fetcher, the 
>>>> error just happens much later (after about 20 fetch cycles).
>>>> We did many test runs, eliminated as much plugins as possible and 
>>>> identified URLs which are most likely to fail.
>>>> With the following configuration we get a corrupt crawldb after two fetch2 
>>>> cycles:
>>>> - activated plugins: protocol-http, parse-html, feed
>>>> - generate.max.per.host - 100
>>>> - URLs to fetch:
>>>> http://www.prosieben.de/service/newsflash/
>>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249
>>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239
>>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238
>>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241
>>>> http://www.prosieben.de/kino_dvd/news/60897/
>>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278
>>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268
>>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279
>>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267
>>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/
>>>> http://www.prosieben.de/spielfilm_serie/topstories/61051/
>>>> http://www.prosieben.de/kino_dvd/news/60897/
>>>>
>>>> When starting from an higher URL like http://www.prosieben.de these URLs 
>>>> get the following warn message after some fetch cycles:
>>>> WARN  parse.ParseOutputFormat - Can't read fetch time for: 
>>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/
>>>> But the crawldb does not get corrupt immediately after the first occurence 
>>>> of such messages, it gets corrupted some cyles later.
>>>>
>>>> Any suggestions are highly appreciated.
>>>> Something seems to go wrong with the feed plugin, but I can't diagnose 
>>>> exactly when and why...
>>>>
>>>> Thanks in advance.
>>>>
>>>> Kind regards,
>>>> Martina
>>>>
>>>>
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Dog(acan Güney [mailto:doga...@gmail.com]
>>>> Gesendet: Freitag, 13. Februar 2009 09:37
>>>> An: nutch-user@lucene.apache.org
>>>> Betreff: Re: Fetcher2 crashes with current trunk
>>>>
>>>> On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina <k...@huberverlag.de> wrote:
>>>>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> we use the current trunk of 04.02.09 with the patch for CrawlDbMerger 
>>>>> (Nutch-683) manually applied.
>>>>> We're doing an inject - generate - fetch - parse - updatedb - invertlinks 
>>>>> cycle at depth 1.
>>>>> When we use Fetcher2, we can do this cycle four times in a row without 
>>>>> any problems. If we start the fifth cycle the Injector crashes with the 
>>>>> following error log:
>>>>>
>>>>> 2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected 
>>>>> urls into crawl db.
>>>>> 2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM 
>>>>> Metrics with processName=JobTracker, sessionId= - already initialized
>>>>> 2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths 
>>>>> to process : 2
>>>>> 2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: 
>>>>> job_local_0002
>>>>> 2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths 
>>>>> to process : 2
>>>>> 2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
>>>>> 2009-02-12 00:00:05,554 INFO  mapred.MapTask - io.sort.mb = 100
>>>>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - data buffer = 
>>>>> 79691776/99614720
>>>>> 2009-02-12 00:00:05,828 INFO  mapred.MapTask - record buffer = 
>>>>> 262144/327680
>>>>> 2009-02-12 00:00:06,538 INFO  mapred.JobClient -  map 0% reduce 0%
>>>>> 2009-02-12 00:00:07,262 WARN  mapred.LocalJobRunner - job_local_0002
>>>>> java.lang.RuntimeException: java.lang.NullPointerException
>>>>>      at 
>>>>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
>>>>>      at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
>>>>>      at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
>>>>>      at 
>>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>>>>>      at 
>>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>>>>>      at 
>>>>> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
>>>>>      at 
>>>>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
>>>>>      at 
>>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
>>>>>      at 
>>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
>>>>>      at 
>>>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
>>>>>      at 
>>>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
>>>>>      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>>>>>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
>>>>>      at 
>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>>>>> Caused by: java.lang.NullPointerException
>>>>>      at 
>>>>> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>>>>>      at 
>>>>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
>>>>>      ... 13 more
>>>>> 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: 
>>>>> java.io.IOException: Job failed!
>>>>>      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>>>>>      at org.apache.nutch.crawl.Injector.inject(Injector.java:169)
>>>>>      at org.apache.nutch.crawl.Injector.run(Injector.java:190)
>>>>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>      at org.apache.nutch.crawl.Injector.main(Injector.java:180)
>>>>>
>>>>> After that the crawldb is broken and can't be accessed e.g. with the 
>>>>> readdb <crawldb> -stats command.
>>>>> When we use for exactly the same task Fetcher instead of Fetcher2, we can 
>>>>> do as many cycles as we like without any problems or crashes.
>>>>>
>>>>> Besides this error we've observed that the fetch-cycle with Fetcher is 
>>>>> about twice as fast as Fetcher2, although we use the exact same settings 
>>>>> in the nutch-site:
>>>>> generate.max.per.host  - 100
>>>>> fetcher.threads.per.host - 1
>>>>> fetcher.server.delay - 0
>>>>> for an initial url list with 30 URLs of different hosts.
>>>>>
>>>>> Has anybody observed similar errors or performance issues?
>>>>>
>>>>>
>>>>
>>>> Fetcher - Fetcher2 performance is a confusing issue. There have been
>>>> reports that both
>>>> have been faster than the other. Fetcher2 has a much more flexible and
>>>> smarter architecture
>>>> compared to Fetcher so I can only think that this is some sort of bug
>>>> in Fetcher2 that degrades
>>>> performance.
>>>>
>>>> However, your other problem (Fetcher2 crash) is very weird. I went
>>>> through Fetcher and Fetcher2
>>>> code and there is nothing different in them that will make one work
>>>> and the other fail. Does this
>>>> error consistently happen if you try it again with Fetcher2 from scratch?
>>>>
>>>>
>>>>>
>>>>> Kind regards,
>>>>> Martina
>>>>>
>>>>>
>>>>
>>>> --
>>>> Dog(acan Güney
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
>
> --
> Doğacan Güney
>



-- 
Doğacan Güney

Re: Fetcher2 crashes with current trunk

Reply via email to