Sorry, the mail I just sent was incomplete. This is the complete version: Hello,
I have a strange problem while fetching, maybe someone can point in the right direction? I do a couple of inject/generate/fetch/update-cycles to crawl. in the last cycle approx. 600000 docs should be fetched, but only 150000 are actually fetched. The last thing I see in the log file is 2006-12-04 14:24:11,353 WARN fetcher.Fetcher - Aborting with 1 hung threads. 2006-12-04 14:24:11,353 INFO mapred.LocalJobRunner - 152328 pages, 5801 errors, 1.4 pages/s, 616 kb/s, 2006-12-04 14:24:11,497 INFO mapred.JobClient - map 100% reduce 0% 2006-12-04 14:24:14,221 INFO fetcher.Fetcher - fetch of http://www.microbes.info/forums/index.php?s=7179874ada709ad4d9874517f2790ef0& failed with: java.lang.NullPointerException 2006-12-04 14:24:14,238 FATAL fetcher.Fetcher - java.lang.NullPointerException 2006-12-04 14:24:14,241 FATAL fetcher.Fetcher - at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:198) 2006-12-04 14:24:14,241 FATAL fetcher.Fetcher - at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:189) 2006-12-04 14:24:14,241 FATAL fetcher.Fetcher - at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91) 2006-12-04 14:24:14,241 FATAL fetcher.Fetcher - at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:314) 2006-12-04 14:24:14,241 FATAL fetcher.Fetcher - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:232) 2006-12-04 14:24:14,241 FATAL fetcher.Fetcher - fetcher caught:java.lang.NullPointerException Then nothing happens; approx. 10 minutes later map reduce comes up with 2006-12-04 14:33:06,169 INFO mapred.LocalJobRunner - reduce > sort 2006-12-04 14:33:07,586 INFO mapred.JobClient - map 100% reduce 33% My questions are: - does the use of the log level "FATAL" mean that the fetch process aborts, no matter if there are more urls to be fetched? - does this error point to a problem on the local machine? if so, what should I look at? Setup: - Nutch 0.8.1 on a local machine - 2 GB RAM, Heap-Size 1GB - The fetch-process is called by a shell script (no -adddays or -topN parameters). When I dump that segment afterwards (using the segread-tool), I can see all 600000+ urls, but only 150000 of them have status fetched_success. Any help would be really appreciated, Best regards Karsten ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general