Hi,
I used nutch-0.7.1 to index an intranet. It is a really great tool,
thanks for developing it! I had to hack something quick for
Authentication (somehow couldn't get the crawler to accept the
http.auth.basic.user etc). I also found an issue where parsing an html
page returned an error "Content type is xml not html". Turns out that
sometimes the string "Content-Type" is used instead of "Content-type".
So I hacked HttpResponse.java - toContent method like this:
String contentType = getHeader("Content-type");
If (contentType == null) {
contentType = getHeader("Content-Type");
}
Just thought I'll share with you all.
Thanks,
Thushara