http content limit not working?

charlie w Thu, 10 May 2007 09:35:03 -0700

I'm using Nutch 0.9.

I appears that Nutch is ignoring the http.content.limit number in the config
file.  I have left this setting at the default (64K), and the httpclient
plugin logs that value (...httpclient.Http - http.content.limit = 65536),
yet Nutch is attempting to fetch a 115MB file.


I have verified that the server is indeed sending the content-length header.

As you might imagine, this takes a very long time, and the Fetcher gives up
on that fetcher thread:
2007-05-10 08:29:31,963 WARN  fetcher.Fetcher - Aborting with 1 hung
threads.

Then, the next time around the generate/fetch/update loop, Nutch again tries
to fetch that same document.  I'm doing a very deep crawl, and this
combination of behaviors wound up giving me 10 threads all hung fetching the
same file.  Bandwidth issues followed...

I've groveled the code, and while I see HttpBase reading the
http.content.limit value from the config, I can't find anywhere that the
value is actually used.

Now this is a flv file, and sure, I could easily filter the URL out based in
its extension, but I'd like to find a more generic solution.

So is it true that http.content.limit is not implemented yet or is there
something I don't understand about the way it is meant to work?

Regards
Charlie

http content limit not working?

Reply via email to