I think I have found the bug here, but I am in a hurry now, I will create a JIRA issue and post (what is hopefully) the fix later today.
On Tue, Feb 17, 2009 at 21:39, Doğacan Güney <doga...@gmail.com> wrote: > 2009/2/17 Sami Siren <ssi...@gmail.com>: >> Do we have a Jira issue for this, seems like a blocker for 1.0 to me if it >> is reproducible. >> > > No we don't. But you are right that we should. I am very busy and I > forgot about it. I will > examine this problem in more detail tomorrow and will open an issue if > I can reproduce > the bug. > >> -- >> Sami Siren >> >> >> Dog(acan Güney wrote: >>> >>> Thanks for detailed analysis. I will take a look and get back to you. >>> >>> On Mon, Feb 16, 2009 at 13:41, Koch Martina <k...@huberverlag.de> wrote: >>> >>>> >>>> Hi, >>>> >>>> sorry for the late reply. We did some further digging and found that the >>>> error has nothing to do with Fetcher or Fetcher2. When using Fetcher, the >>>> error just happens much later (after about 20 fetch cycles). >>>> We did many test runs, eliminated as much plugins as possible and >>>> identified URLs which are most likely to fail. >>>> With the following configuration we get a corrupt crawldb after two fetch2 >>>> cycles: >>>> - activated plugins: protocol-http, parse-html, feed >>>> - generate.max.per.host - 100 >>>> - URLs to fetch: >>>> http://www.prosieben.de/service/newsflash/ >>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249 >>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239 >>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238 >>>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241 >>>> http://www.prosieben.de/kino_dvd/news/60897/ >>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278 >>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268 >>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279 >>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267 >>>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259 >>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/ >>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/ >>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/ >>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/ >>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/ >>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/ >>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/ >>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/ >>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/ >>>> http://www.prosieben.de/spielfilm_serie/topstories/61051/ >>>> http://www.prosieben.de/kino_dvd/news/60897/ >>>> >>>> When starting from an higher URL like http://www.prosieben.de these URLs >>>> get the following warn message after some fetch cycles: >>>> WARN parse.ParseOutputFormat - Can't read fetch time for: >>>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/ >>>> But the crawldb does not get corrupt immediately after the first occurence >>>> of such messages, it gets corrupted some cyles later. >>>> >>>> Any suggestions are highly appreciated. >>>> Something seems to go wrong with the feed plugin, but I can't diagnose >>>> exactly when and why... >>>> >>>> Thanks in advance. >>>> >>>> Kind regards, >>>> Martina >>>> >>>> >>>> >>>> -----Ursprüngliche Nachricht----- >>>> Von: Dog(acan Güney [mailto:doga...@gmail.com] >>>> Gesendet: Freitag, 13. Februar 2009 09:37 >>>> An: nutch-user@lucene.apache.org >>>> Betreff: Re: Fetcher2 crashes with current trunk >>>> >>>> On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina <k...@huberverlag.de> wrote: >>>> >>>>> >>>>> Hi all, >>>>> >>>>> we use the current trunk of 04.02.09 with the patch for CrawlDbMerger >>>>> (Nutch-683) manually applied. >>>>> We're doing an inject - generate - fetch - parse - updatedb - invertlinks >>>>> cycle at depth 1. >>>>> When we use Fetcher2, we can do this cycle four times in a row without >>>>> any problems. If we start the fifth cycle the Injector crashes with the >>>>> following error log: >>>>> >>>>> 2009-02-12 00:00:05,015 INFO crawl.Injector - Injector: Merging injected >>>>> urls into crawl db. >>>>> 2009-02-12 00:00:05,023 INFO jvm.JvmMetrics - Cannot initialize JVM >>>>> Metrics with processName=JobTracker, sessionId= - already initialized >>>>> 2009-02-12 00:00:05,358 INFO mapred.FileInputFormat - Total input paths >>>>> to process : 2 >>>>> 2009-02-12 00:00:05,524 INFO mapred.JobClient - Running job: >>>>> job_local_0002 >>>>> 2009-02-12 00:00:05,528 INFO mapred.FileInputFormat - Total input paths >>>>> to process : 2 >>>>> 2009-02-12 00:00:05,553 INFO mapred.MapTask - numReduceTasks: 1 >>>>> 2009-02-12 00:00:05,554 INFO mapred.MapTask - io.sort.mb = 100 >>>>> 2009-02-12 00:00:05,828 INFO mapred.MapTask - data buffer = >>>>> 79691776/99614720 >>>>> 2009-02-12 00:00:05,828 INFO mapred.MapTask - record buffer = >>>>> 262144/327680 >>>>> 2009-02-12 00:00:06,538 INFO mapred.JobClient - map 0% reduce 0% >>>>> 2009-02-12 00:00:07,262 WARN mapred.LocalJobRunner - job_local_0002 >>>>> java.lang.RuntimeException: java.lang.NullPointerException >>>>> at >>>>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81) >>>>> at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164) >>>>> at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262) >>>>> at >>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) >>>>> at >>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) >>>>> at >>>>> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817) >>>>> at >>>>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790) >>>>> at >>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103) >>>>> at >>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78) >>>>> at >>>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186) >>>>> at >>>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170) >>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) >>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) >>>>> at >>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) >>>>> Caused by: java.lang.NullPointerException >>>>> at >>>>> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) >>>>> at >>>>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73) >>>>> ... 13 more >>>>> 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: >>>>> java.io.IOException: Job failed! >>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) >>>>> at org.apache.nutch.crawl.Injector.inject(Injector.java:169) >>>>> at org.apache.nutch.crawl.Injector.run(Injector.java:190) >>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>> at org.apache.nutch.crawl.Injector.main(Injector.java:180) >>>>> >>>>> After that the crawldb is broken and can't be accessed e.g. with the >>>>> readdb <crawldb> -stats command. >>>>> When we use for exactly the same task Fetcher instead of Fetcher2, we can >>>>> do as many cycles as we like without any problems or crashes. >>>>> >>>>> Besides this error we've observed that the fetch-cycle with Fetcher is >>>>> about twice as fast as Fetcher2, although we use the exact same settings >>>>> in the nutch-site: >>>>> generate.max.per.host - 100 >>>>> fetcher.threads.per.host - 1 >>>>> fetcher.server.delay - 0 >>>>> for an initial url list with 30 URLs of different hosts. >>>>> >>>>> Has anybody observed similar errors or performance issues? >>>>> >>>>> >>>> >>>> Fetcher - Fetcher2 performance is a confusing issue. There have been >>>> reports that both >>>> have been faster than the other. Fetcher2 has a much more flexible and >>>> smarter architecture >>>> compared to Fetcher so I can only think that this is some sort of bug >>>> in Fetcher2 that degrades >>>> performance. >>>> >>>> However, your other problem (Fetcher2 crash) is very weird. I went >>>> through Fetcher and Fetcher2 >>>> code and there is nothing different in them that will make one work >>>> and the other fail. Does this >>>> error consistently happen if you try it again with Fetcher2 from scratch? >>>> >>>> >>>>> >>>>> Kind regards, >>>>> Martina >>>>> >>>>> >>>> >>>> -- >>>> Dog(acan Güney >>>> >>>> >>> >>> >>> >>> >> >> > > > > -- > Doğacan Güney > -- Doğacan Güney