Thank you for the great and detailed information Susam! 
Will post back my test program when successful.

Thanks, 
Ann Del Rio

-----Original Message-----
From: Susam Pal [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 16, 2008 9:48 AM
To: [email protected]
Subject: Re: how does nutch connect to urls internally?

Hi,

It depends on which protocol plugin is enabled in your
'conf/nutch-site.xml'. The property to look for is 'plugins.include'
in the XML file. If this is not present in 'conf/nutch-site.xml', it
means you are using the default 'plugins.include' of
'conf/nutch-default.xml'.

If protocol-http is enabled, then you have to go through the code in:-

src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.ja
va
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpRes
ponse.java

If protocol-httpclient is enabled, then you have to go through:-

src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl
ient/Http.java
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl
ient/HttpResponse.java

Enabling DEBUG logs in 'conf/log4j.properties' will also give you clues
about the problems. The logs are written to 'logs/hadoop.log'.
To enable the DEBUG logs for a particular package, say, the httpclient
package, you can open 'conf/log4j.properties' and add the following
line:

log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout

Regards,
Susam Pal

On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann <[EMAIL PROTECTED]> wrote:
> Good morning,
>
> Can you please point me to a Nutch documentation where I can find how 
> nutch connects to the webpages when it crawls? I think it is through 
> HTTP but i would like to confirm and get more details so i can write a

> very small test java program to connect to one of the webpages i am 
> having trouble connecting / crawling. I bought Lucene in Action and am

> half way thru the book and so far there is very little about Nutch.
>
> Thanks,
> Ann Del Rio

Reply via email to