Hi,

Feng Ji wrote:
hi there,

I got the huge percentage of fetching error for httpclient in hadoop log as
followings:

"
httpclient.HttpMethodDirector
:
httpclient.HttpMethodDirector - Redirect requested but followRedirects is
disabled
:
"

I am not sure if this is an error. Plugin protocol-httpclient plugin uses Apache's commons-httpclient library to request pages. This library can normally follow through redirects, but since Nutch's fetcher handles redirects, httpclient's followRedirect is disabled. So when a request returns redirect, httpclient reports that it won't be following this redirect(hence the above output),
Nutch's fetcher sees this and makes a new request to the redirected url.

I setup plugin.includes in nutch-site.xml as
"
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|htmlsig)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|auth-(httpbasic|httpform)</value>
"

The default nutch 08 plugin.includes is
"<value>
protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
"

Does my setting wrong and cause the problem? My small size testing is ok,
but for large scale, I got lots of fetcher failure.

Should I just use nutch-08 default plugin.includes setting?

thanks,

Michael,

By the way, where I should tell nutch to crawl down pdf and word file in
nutch-08?


Reply via email to