Thank you for the great and detailed information Susam! Will post back my test program when successful.
Thanks, Ann Del Rio -----Original Message----- From: Susam Pal [mailto:[EMAIL PROTECTED] Sent: Monday, June 16, 2008 9:48 AM To: [email protected] Subject: Re: how does nutch connect to urls internally? Hi, It depends on which protocol plugin is enabled in your 'conf/nutch-site.xml'. The property to look for is 'plugins.include' in the XML file. If this is not present in 'conf/nutch-site.xml', it means you are using the default 'plugins.include' of 'conf/nutch-default.xml'. If protocol-http is enabled, then you have to go through the code in:- src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.ja va src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpRes ponse.java If protocol-httpclient is enabled, then you have to go through:- src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl ient/Http.java src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl ient/HttpResponse.java Enabling DEBUG logs in 'conf/log4j.properties' will also give you clues about the problems. The logs are written to 'logs/hadoop.log'. To enable the DEBUG logs for a particular package, say, the httpclient package, you can open 'conf/log4j.properties' and add the following line: log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout Regards, Susam Pal On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann <[EMAIL PROTECTED]> wrote: > Good morning, > > Can you please point me to a Nutch documentation where I can find how > nutch connects to the webpages when it crawls? I think it is through > HTTP but i would like to confirm and get more details so i can write a > very small test java program to connect to one of the webpages i am > having trouble connecting / crawling. I bought Lucene in Action and am > half way thru the book and so far there is very little about Nutch. > > Thanks, > Ann Del Rio
