On 8/30/06, Feng Ji <[EMAIL PROTECTED]> wrote:
 hi there,

I got the huge percentage of fetching error for httpclient in hadoop log as
followings:

"
httpclient.HttpMethodDirector
:
httpclient.HttpMethodDirector - Redirect requested but followRedirects is
disabled
:
"

I setup plugin.includes in nutch-site.xml as
"
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|htmlsig)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|auth-(httpbasic|httpform)</value>
"

Maybe you want to use protocol-http instead.. I have found it to be
better offcourse you loose some feautre like crawling auth. stite.

The default nutch 08 plugin.includes is
"<value>
protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
"

Does my setting wrong and cause the problem? My small size testing is ok,
but for large scale, I got lots of fetcher failure.


Should I just use nutch-08 default plugin.includes setting?

thanks,

Michael,

How many fetcher do you have? I think you need to adjust the fetcher
threads value. Also you could try this if you are just experimenting.

http://issues.apache.org/jira/browse/NUTCH-339

and you can also try this which might be interesting to get your
simulate to tweak your nutch-site.xml

http://issues.apache.org/jira/browse/NUTCH-357

By the way, where I should tell nutch to crawl down pdf and word file in
nutch-08?

regex-urlfilter.txt where you have -gif|doc etc..

Reply via email to