Sorry, the mail I just sent was incomplete. This is the complete version:

Hello,

I have a strange problem while fetching,
maybe someone can point in the right direction?

I do a couple of inject/generate/fetch/update-cycles to crawl.

in the last cycle approx. 600000 docs should be fetched, but only 150000 are
actually fetched.

The last thing I see in the log file is

2006-12-04 14:24:11,353 WARN  fetcher.Fetcher - Aborting with 1 hung threads.
2006-12-04 14:24:11,353 INFO  mapred.LocalJobRunner - 152328 pages,
5801 errors, 1.4 pages/s, 616 kb/s,
2006-12-04 14:24:11,497 INFO  mapred.JobClient -  map 100%  reduce 0%
2006-12-04 14:24:14,221 INFO  fetcher.Fetcher - fetch of
http://www.microbes.info/forums/index.php?s=7179874ada709ad4d9874517f2790ef0&;
failed with: java.lang.NullPointerException
2006-12-04 14:24:14,238 FATAL fetcher.Fetcher - java.lang.NullPointerException
2006-12-04 14:24:14,241 FATAL fetcher.Fetcher - at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:198)
2006-12-04 14:24:14,241 FATAL fetcher.Fetcher - at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:189)
2006-12-04 14:24:14,241 FATAL fetcher.Fetcher - at
org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91)
2006-12-04 14:24:14,241 FATAL fetcher.Fetcher - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:314)
2006-12-04 14:24:14,241 FATAL fetcher.Fetcher - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:232)
2006-12-04 14:24:14,241 FATAL fetcher.Fetcher - fetcher
caught:java.lang.NullPointerException


Then nothing happens; approx. 10 minutes later map reduce comes up with
2006-12-04 14:33:06,169 INFO  mapred.LocalJobRunner - reduce > sort
2006-12-04 14:33:07,586 INFO  mapred.JobClient -  map 100%  reduce 33%

My questions are:
- does the use of the log level "FATAL" mean that the fetch process
aborts, no matter if there are more urls to be fetched?
- does this error point to a problem on the local machine? if so, what
should I look at?

Setup:
- Nutch 0.8.1 on a local machine
- 2 GB RAM,  Heap-Size 1GB
- The fetch-process is called by a shell script (no -adddays or -topN
parameters).

When I dump that segment afterwards (using the segread-tool), I can
see all 600000+ urls,
but only 150000 of them have status fetched_success.

Any help would be really appreciated,

Best regards
Karsten

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to