I'm using Nutch 0.9. I appears that Nutch is ignoring the http.content.limit number in the config file. I have left this setting at the default (64K), and the httpclient plugin logs that value (...httpclient.Http - http.content.limit = 65536), yet Nutch is attempting to fetch a 115MB file.
I have verified that the server is indeed sending the content-length header. As you might imagine, this takes a very long time, and the Fetcher gives up on that fetcher thread: 2007-05-10 08:29:31,963 WARN fetcher.Fetcher - Aborting with 1 hung threads. Then, the next time around the generate/fetch/update loop, Nutch again tries to fetch that same document. I'm doing a very deep crawl, and this combination of behaviors wound up giving me 10 threads all hung fetching the same file. Bandwidth issues followed... I've groveled the code, and while I see HttpBase reading the http.content.limit value from the config, I can't find anywhere that the value is actually used. Now this is a flv file, and sure, I could easily filter the URL out based in its extension, but I'd like to find a more generic solution. So is it true that http.content.limit is not implemented yet or is there something I don't understand about the way it is meant to work? Regards Charlie
