2009/2/17 Sami Siren <ssi...@gmail.com>: > Do we have a Jira issue for this, seems like a blocker for 1.0 to me if it is > reproducible. >
No we don't. But you are right that we should. I am very busy and I forgot about it. I will examine this problem in more detail tomorrow and will open an issue if I can reproduce the bug. > -- > Sami Siren > > > Dog(acan Güney wrote: >> >> Thanks for detailed analysis. I will take a look and get back to you. >> >> On Mon, Feb 16, 2009 at 13:41, Koch Martina <k...@huberverlag.de> wrote: >> >>> >>> Hi, >>> >>> sorry for the late reply. We did some further digging and found that the >>> error has nothing to do with Fetcher or Fetcher2. When using Fetcher, the >>> error just happens much later (after about 20 fetch cycles). >>> We did many test runs, eliminated as much plugins as possible and >>> identified URLs which are most likely to fail. >>> With the following configuration we get a corrupt crawldb after two fetch2 >>> cycles: >>> - activated plugins: protocol-http, parse-html, feed >>> - generate.max.per.host - 100 >>> - URLs to fetch: >>> http://www.prosieben.de/service/newsflash/ >>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249 >>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239 >>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238 >>> http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241 >>> http://www.prosieben.de/kino_dvd/news/60897/ >>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278 >>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268 >>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279 >>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267 >>> http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259 >>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/ >>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/ >>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/ >>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/ >>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/ >>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/ >>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/ >>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/ >>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/ >>> http://www.prosieben.de/spielfilm_serie/topstories/61051/ >>> http://www.prosieben.de/kino_dvd/news/60897/ >>> >>> When starting from an higher URL like http://www.prosieben.de these URLs >>> get the following warn message after some fetch cycles: >>> WARN parse.ParseOutputFormat - Can't read fetch time for: >>> http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/ >>> But the crawldb does not get corrupt immediately after the first occurence >>> of such messages, it gets corrupted some cyles later. >>> >>> Any suggestions are highly appreciated. >>> Something seems to go wrong with the feed plugin, but I can't diagnose >>> exactly when and why... >>> >>> Thanks in advance. >>> >>> Kind regards, >>> Martina >>> >>> >>> >>> -----Ursprüngliche Nachricht----- >>> Von: Dog(acan Güney [mailto:doga...@gmail.com] >>> Gesendet: Freitag, 13. Februar 2009 09:37 >>> An: nutch-user@lucene.apache.org >>> Betreff: Re: Fetcher2 crashes with current trunk >>> >>> On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina <k...@huberverlag.de> wrote: >>> >>>> >>>> Hi all, >>>> >>>> we use the current trunk of 04.02.09 with the patch for CrawlDbMerger >>>> (Nutch-683) manually applied. >>>> We're doing an inject - generate - fetch - parse - updatedb - invertlinks >>>> cycle at depth 1. >>>> When we use Fetcher2, we can do this cycle four times in a row without any >>>> problems. If we start the fifth cycle the Injector crashes with the >>>> following error log: >>>> >>>> 2009-02-12 00:00:05,015 INFO crawl.Injector - Injector: Merging injected >>>> urls into crawl db. >>>> 2009-02-12 00:00:05,023 INFO jvm.JvmMetrics - Cannot initialize JVM >>>> Metrics with processName=JobTracker, sessionId= - already initialized >>>> 2009-02-12 00:00:05,358 INFO mapred.FileInputFormat - Total input paths >>>> to process : 2 >>>> 2009-02-12 00:00:05,524 INFO mapred.JobClient - Running job: >>>> job_local_0002 >>>> 2009-02-12 00:00:05,528 INFO mapred.FileInputFormat - Total input paths >>>> to process : 2 >>>> 2009-02-12 00:00:05,553 INFO mapred.MapTask - numReduceTasks: 1 >>>> 2009-02-12 00:00:05,554 INFO mapred.MapTask - io.sort.mb = 100 >>>> 2009-02-12 00:00:05,828 INFO mapred.MapTask - data buffer = >>>> 79691776/99614720 >>>> 2009-02-12 00:00:05,828 INFO mapred.MapTask - record buffer = >>>> 262144/327680 >>>> 2009-02-12 00:00:06,538 INFO mapred.JobClient - map 0% reduce 0% >>>> 2009-02-12 00:00:07,262 WARN mapred.LocalJobRunner - job_local_0002 >>>> java.lang.RuntimeException: java.lang.NullPointerException >>>> at >>>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81) >>>> at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164) >>>> at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262) >>>> at >>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) >>>> at >>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) >>>> at >>>> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817) >>>> at >>>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790) >>>> at >>>> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103) >>>> at >>>> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78) >>>> at >>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186) >>>> at >>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170) >>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) >>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) >>>> at >>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) >>>> Caused by: java.lang.NullPointerException >>>> at >>>> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) >>>> at >>>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73) >>>> ... 13 more >>>> 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: >>>> java.io.IOException: Job failed! >>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) >>>> at org.apache.nutch.crawl.Injector.inject(Injector.java:169) >>>> at org.apache.nutch.crawl.Injector.run(Injector.java:190) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>> at org.apache.nutch.crawl.Injector.main(Injector.java:180) >>>> >>>> After that the crawldb is broken and can't be accessed e.g. with the >>>> readdb <crawldb> -stats command. >>>> When we use for exactly the same task Fetcher instead of Fetcher2, we can >>>> do as many cycles as we like without any problems or crashes. >>>> >>>> Besides this error we've observed that the fetch-cycle with Fetcher is >>>> about twice as fast as Fetcher2, although we use the exact same settings >>>> in the nutch-site: >>>> generate.max.per.host - 100 >>>> fetcher.threads.per.host - 1 >>>> fetcher.server.delay - 0 >>>> for an initial url list with 30 URLs of different hosts. >>>> >>>> Has anybody observed similar errors or performance issues? >>>> >>>> >>> >>> Fetcher - Fetcher2 performance is a confusing issue. There have been >>> reports that both >>> have been faster than the other. Fetcher2 has a much more flexible and >>> smarter architecture >>> compared to Fetcher so I can only think that this is some sort of bug >>> in Fetcher2 that degrades >>> performance. >>> >>> However, your other problem (Fetcher2 crash) is very weird. I went >>> through Fetcher and Fetcher2 >>> code and there is nothing different in them that will make one work >>> and the other fail. Does this >>> error consistently happen if you try it again with Fetcher2 from scratch? >>> >>> >>>> >>>> Kind regards, >>>> Martina >>>> >>>> >>> >>> -- >>> Dog(acan Güney >>> >>> >> >> >> >> > > -- Doğacan Güney