I'm having an interesting problem that I think revolves around the interplay of a few settings that I'm not really clear on how they affect the crawl.
Currently I have: content.limit = -1 fetcher.threads = 1000 fetcher.threads.per host = 100 indexer.max.tokens = 750000 I also increased the JAVA_HEAP space to account for the additional tokens. I'm not getting any out of memory errors, so that part should be okay. The problem is that with the content limit set high or not at all (I have tried other values), I get Fetch errors with NullPointerExceptions on one set of files (html files), these are fairly large html files, but not over 1MB. If I set the content limit to a reasonable amount, say 5MB, the nullpointerexceptions go away, but I get a lot of truncation errors on a different group of files (pdf files, all over 5MB). I'm trying to find a sweet spot where I can fetch/index all of my pdf files, while not having the crawl bomb out, which it does if it gets too many errors. I'm not sure if the threads and threads per host play any role. I feel like I got a better crawl when I have them set a little more modestly, but I read in another thread somewhere that a good server should handle those settings and I'm running this on a quad-core Opteron server. I'm also not sure if maybe some of the parse setting are affecting anything. I got rid of index-more, but ultimately I think I'd like to put that back if I can. -- View this message in context: http://www.nabble.com/NullPointerExceptions-in-Fetch-tp23333304p23333304.html Sent from the Nutch - User mailing list archive at Nabble.com.
