Thanks for detailed analysis. I will take a look and get back to you. On Mon, Feb 16, 2009 at 13:41, Koch Martina <k...@huberverlag.de> wrote: > Hi, > > sorry for the late reply. We did some further digging and found that the > error has nothing to do with Fetcher or Fetcher2. When using Fetcher, the > error just happens much later (after about 20 fetch cycles). > We did many test runs, eliminated as much plugins as possible and identified > URLs which are most likely to fail. > With the following configuration we get a corrupt crawldb after two fetch2 > cycles: > - activated plugins: protocol-http, parse-html, feed > - generate.max.per.host - 100 > - URLs to fetch: > http://www.prosieben.de/service/newsflash/ > http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/13161/Berlin-Today-Award-fuer-Indien/news_details/4249 > http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/6186/Ein-Kreuz-fuer-Orlando/news_details/4239 > http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/7622/Hermione-fliegt-nach-Amerika/news_details/4238 > http://www.prosieben.de/kino_dvd/kino/filme/archiv/movies/9276/Auf-zum-zweiten-Zickenkrieg/news_details/4241 > http://www.prosieben.de/kino_dvd/news/60897/ > http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/16567/Bitte-um-mehr-Aufmerksamkeit/news_details/4278 > http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2374/Unschuldig-im-Knast/news_details/4268 > http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/2936/Aus-fuer-Nachwuchsfilmer/news_details/4279 > http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/58906/David-Kross-schie-t-hoch/news_details/4267 > http://www.prosieben.de/kino_dvd/stars/starportraits/archiv/persons/908/Cate-Blanchett-wird-Maid-Marian/news_details/4259 > http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/ > http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60910/ > http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60958/ > http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60959/ > http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60998/ > http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61000/ > http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61050/ > http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61085/ > http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/61087/ > http://www.prosieben.de/spielfilm_serie/topstories/61051/ > http://www.prosieben.de/kino_dvd/news/60897/ > > When starting from an higher URL like http://www.prosieben.de these URLs get > the following warn message after some fetch cycles: > WARN parse.ParseOutputFormat - Can't read fetch time for: > http://www.prosieben.de/lifestyle_magazine/vips/klatsch/artikel/60881/ > But the crawldb does not get corrupt immediately after the first occurence of > such messages, it gets corrupted some cyles later. > > Any suggestions are highly appreciated. > Something seems to go wrong with the feed plugin, but I can't diagnose > exactly when and why... > > Thanks in advance. > > Kind regards, > Martina > > > > -----Ursprüngliche Nachricht----- > Von: Doğacan Güney [mailto:doga...@gmail.com] > Gesendet: Freitag, 13. Februar 2009 09:37 > An: nutch-user@lucene.apache.org > Betreff: Re: Fetcher2 crashes with current trunk > > On Thu, Feb 12, 2009 at 5:16 PM, Koch Martina <k...@huberverlag.de> wrote: >> Hi all, >> >> we use the current trunk of 04.02.09 with the patch for CrawlDbMerger >> (Nutch-683) manually applied. >> We're doing an inject - generate - fetch - parse - updatedb - invertlinks >> cycle at depth 1. >> When we use Fetcher2, we can do this cycle four times in a row without any >> problems. If we start the fifth cycle the Injector crashes with the >> following error log: >> >> 2009-02-12 00:00:05,015 INFO crawl.Injector - Injector: Merging injected >> urls into crawl db. >> 2009-02-12 00:00:05,023 INFO jvm.JvmMetrics - Cannot initialize JVM Metrics >> with processName=JobTracker, sessionId= - already initialized >> 2009-02-12 00:00:05,358 INFO mapred.FileInputFormat - Total input paths to >> process : 2 >> 2009-02-12 00:00:05,524 INFO mapred.JobClient - Running job: job_local_0002 >> 2009-02-12 00:00:05,528 INFO mapred.FileInputFormat - Total input paths to >> process : 2 >> 2009-02-12 00:00:05,553 INFO mapred.MapTask - numReduceTasks: 1 >> 2009-02-12 00:00:05,554 INFO mapred.MapTask - io.sort.mb = 100 >> 2009-02-12 00:00:05,828 INFO mapred.MapTask - data buffer = >> 79691776/99614720 >> 2009-02-12 00:00:05,828 INFO mapred.MapTask - record buffer = 262144/327680 >> 2009-02-12 00:00:06,538 INFO mapred.JobClient - map 0% reduce 0% >> 2009-02-12 00:00:07,262 WARN mapred.LocalJobRunner - job_local_0002 >> java.lang.RuntimeException: java.lang.NullPointerException >> at >> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81) >> at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164) >> at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262) >> at >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) >> at >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) >> at >> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817) >> at >> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790) >> at >> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103) >> at >> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78) >> at >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186) >> at >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170) >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) >> Caused by: java.lang.NullPointerException >> at >> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) >> at >> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73) >> ... 13 more >> 2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: >> java.io.IOException: Job failed! >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) >> at org.apache.nutch.crawl.Injector.inject(Injector.java:169) >> at org.apache.nutch.crawl.Injector.run(Injector.java:190) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.crawl.Injector.main(Injector.java:180) >> >> After that the crawldb is broken and can't be accessed e.g. with the readdb >> <crawldb> -stats command. >> When we use for exactly the same task Fetcher instead of Fetcher2, we can do >> as many cycles as we like without any problems or crashes. >> >> Besides this error we've observed that the fetch-cycle with Fetcher is about >> twice as fast as Fetcher2, although we use the exact same settings in the >> nutch-site: >> generate.max.per.host - 100 >> fetcher.threads.per.host - 1 >> fetcher.server.delay - 0 >> for an initial url list with 30 URLs of different hosts. >> >> Has anybody observed similar errors or performance issues? >> > > Fetcher - Fetcher2 performance is a confusing issue. There have been > reports that both > have been faster than the other. Fetcher2 has a much more flexible and > smarter architecture > compared to Fetcher so I can only think that this is some sort of bug > in Fetcher2 that degrades > performance. > > However, your other problem (Fetcher2 crash) is very weird. I went > through Fetcher and Fetcher2 > code and there is nothing different in them that will make one work > and the other fail. Does this > error consistently happen if you try it again with Fetcher2 from scratch? > >> Kind regards, >> Martina >> > > > > -- > Doğacan Güney >
-- Doğacan Güney