On 8/30/06, Feng Ji <[EMAIL PROTECTED]> wrote:
hi there,
I got the huge percentage of fetching error for httpclient in hadoop log as
followings:
"
httpclient.HttpMethodDirector
:
httpclient.HttpMethodDirector - Redirect requested but followRedirects is
disabled
:
"
I setup plugin.includes in nutch-site.xml as
"
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|htmlsig)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|auth-(httpbasic|httpform)</value>
"
Maybe you want to use protocol-http instead.. I have found it to be
better offcourse you loose some feautre like crawling auth. stite.
The default nutch 08 plugin.includes is
"<value>
protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
"
Does my setting wrong and cause the problem? My small size testing is ok,
but for large scale, I got lots of fetcher failure.
Should I just use nutch-08 default plugin.includes setting?
thanks,
Michael,
How many fetcher do you have? I think you need to adjust the fetcher
threads value. Also you could try this if you are just experimenting.
http://issues.apache.org/jira/browse/NUTCH-339
and you can also try this which might be interesting to get your
simulate to tweak your nutch-site.xml
http://issues.apache.org/jira/browse/NUTCH-357
By the way, where I should tell nutch to crawl down pdf and word file in
nutch-08?
regex-urlfilter.txt where you have -gif|doc etc..