Hi,

It depends on which protocol plugin is enabled in your
'conf/nutch-site.xml'. The property to look for is 'plugins.include'
in the XML file. If this is not present in 'conf/nutch-site.xml', it
means you are using the default 'plugins.include' of
'conf/nutch-default.xml'.

If protocol-http is enabled, then you have to go through the code in:-

src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.java
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java

If protocol-httpclient is enabled, then you have to go through:-

src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java

Enabling DEBUG logs in 'conf/log4j.properties' will also give you
clues about the problems. The logs are written to 'logs/hadoop.log'.
To enable the DEBUG logs for a particular package, say, the httpclient
package, you can open 'conf/log4j.properties' and add the following
line:

log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout

Regards,
Susam Pal

On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann <[EMAIL PROTECTED]> wrote:
> Good morning,
>
> Can you please point me to a Nutch documentation where I can find how nutch
> connects to the webpages when it crawls? I think it is through HTTP but i
> would like to confirm and get more details so i can write a very small test
> java program to connect to one of the webpages i am having trouble
> connecting / crawling. I bought Lucene in Action and am half way thru the
> book and so far there is very little about Nutch.
>
> Thanks,
> Ann Del Rio
> Ph: 408.376.6504
> E-mail: [EMAIL PROTECTED]
> Skype: delrio_alan
>

Reply via email to