Hi,
Feng Ji wrote:
hi there,
I got the huge percentage of fetching error for httpclient in hadoop
log as
followings:
"
httpclient.HttpMethodDirector
:
httpclient.HttpMethodDirector - Redirect requested but followRedirects is
disabled
:
"
I am not sure if this is an error. Plugin protocol-httpclient plugin
uses Apache's commons-httpclient library to request pages.
This library can normally follow through redirects, but since Nutch's
fetcher handles redirects, httpclient's followRedirect
is disabled. So when a request returns redirect, httpclient reports that
it won't be following this redirect(hence the above output),
Nutch's fetcher sees this and makes a new request to the redirected url.
I setup plugin.includes in nutch-site.xml as
"
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|htmlsig)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|auth-(httpbasic|httpform)</value>
"
The default nutch 08 plugin.includes is
"<value>
protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
"
Does my setting wrong and cause the problem? My small size testing is ok,
but for large scale, I got lots of fetcher failure.
Should I just use nutch-08 default plugin.includes setting?
thanks,
Michael,
By the way, where I should tell nutch to crawl down pdf and word file in
nutch-08?