I'm having an interesting problem that I think revolves around the interplay
of a few settings that I'm not really clear on how they affect the crawl.

Currently I have:

content.limit = -1
fetcher.threads = 1000
fetcher.threads.per host = 100
indexer.max.tokens = 750000

I also increased the JAVA_HEAP space to account for the additional tokens.
I'm not getting any out of memory errors, so that part should be okay.

The problem is that with the content limit set high or not at all (I have
tried other values), I get Fetch errors with NullPointerExceptions on one
set of files (html files), these are fairly large html files, but not over
1MB. If I set the content limit to a reasonable amount, say 5MB, the
nullpointerexceptions go away, but I get a lot of truncation errors on a
different group of files (pdf files, all over 5MB).

I'm trying to find a sweet spot where I can fetch/index all of my pdf files,
while not having the crawl bomb out, which it does if it gets too many
errors.

I'm not sure if the threads and threads per host play any role. I feel like
I got a better crawl when I have them set a little more modestly, but I read
in another thread somewhere that a good server should handle those settings
and I'm running this on a quad-core Opteron server.

I'm also not sure if maybe some of the parse setting are affecting anything.
I got rid of index-more, but ultimately I think I'd like to put that back if
I can.


-- 
View this message in context: 
http://www.nabble.com/NullPointerExceptions-in-Fetch-tp23333304p23333304.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to