Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
troubleshooting tips and information to be provided while asking for help

------------------------------------------------------------------------------
  == Introduction ==
- 'protocol-httpclient' is a protocol plugin which supports retrieving 
documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with 
Basic, Digest and NTLM authentication schemes for web server as well as proxy 
server.
+ 'protocol-httpclient' is a protocol plugin which supports retrieving 
documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with 
Basic, Digest and NTLM authentication schemes for web server as well as proxy 
server. This feature can not do POST based authentication that depends on 
cookies. More information on this can be found at: HttpPostAuthentication
  
  == Necessity ==
- There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used to configure authentication. The author (Susam Pal) of 
these features has tested it in Infosys Technologies Limited by crawling the 
corporate intranet requiring NTLM authentication and this has been found to 
work well.
+ There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used to configure authentication.
  
  == JIRA NUTCH-559 ==
  These features were submitted as 
[https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] in the JIRA. 
If you have checked out the latest Nutch trunk, you don't need to apply the 
patches. These features were included in the Nutch subversion repository in 
[http://svn.apache.org/viewvc?view=rev&revision=608972 revision #608972]
@@ -91, +91 @@

  'protocol-httpclient' is based on 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons 
HttpClient]. Some servers support multiple schemes for authenticating users. 
Given that only one scheme may be used at a time for authenticating, it must 
choose which scheme to use. To accompish this, it uses an order of preference 
to select the correct authentication scheme. By default this order is: NTLM, 
Digest, Basic. For more information on the behavior during authentication, you 
might want to read the 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html 
HttpClient Authentication Guide].
  
  == Need Help? ==
- If you need help, please feel free to post your question to the 
[http://lucene.apache.org/nutch/mailing_lists.html#Users nutch-user mailing 
list].
+ If you need help, please feel free to post your question to the 
[http://lucene.apache.org/nutch/mailing_lists.html#Users nutch-user mailing 
list]. The author of this work, Susam Pal, usually responds to mails related to 
authentication problems. The DEBUG logs may be required to troubleshoot the 
problem. You must enable the debug log for 'protocol-httpclient' before running 
the crawler. To enable debug log for 'protocol-httpclient', open 
'conf/log4j.properties' and add the following line:
+ {{{
+ log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
+ }}}
  
+ It would be good to check the following things before asking for help.
+ 
+  1. Have you overridden the 'plugin.includes' property of 
'conf/nutch-default.xml' with 'conf/nutch-site.xml' and replaced 
'protocol-http' with 'protocol-httpclient'?
+  1. If you patched Nutch 0.9 source code manually with this patch, did you 
build the project before running the crawler?
+  1. Have you configured 'conf/httpclient-auth.xml'?
+  1. Do you see Nutch trying to fetch the pages you were expecting in 
'logs/hadoop.log'. You should see some logs like "fetching 
http://www.example.com/expectedpage.html"; where the URL is the page you were 
expecting to be fetched. If you don't see such lines for the pages you were 
expecting, the error is outside the scope of this feature. This feature comes 
into action only when the crawler is fetching a page but the page requires 
authentication.
+  1. With debug logs enabled, check whether there are logs beginning with 
"Credentials" in 'logs/hadoop.log'. The lines would look like "Credentials - 
username someuser; set ...". For every entry in 'conf/httpclient-auth.xml' you 
should find a corresponding log. If they are absent, probably you haven't 
included 'plugin.includes'. In case you have manually patched Nutch 0.9 source 
code with the patch, this issue may be caused if you have not built the project.
+  1. Do you see logs like this: "auth.!AuthChallengeProcessor - basic 
authentication scheme selected"? Instead of the word 'basic', you might see 
'digest' or 'NTLM' depending on the scheme supported by the page being fetched? 
If you do not see it at all, probably the web server or the page being fetched 
does not require authentication. In that case, the crawler would not try to 
authenticate. If you were expecting an authentication for the page, probably 
something needs to be fixed at the server side.
+  1. You should also see some logs that begin with: "Pre-configured 
credentials with scope". It is very unlikely that this should happen after you 
have ensured all the above points. If it happens, please let us know in the 
mailing list.
+ 
+ Once you have checked the items listed above and you are still unable to fix 
the problem or confused about any point listed above, please mail the issue 
with the following information:
+ 
+  1. Version of Nutch you are running.
+  1. Did you get this feature directly from subversion or did you download the 
patch separately and apply?
+  1. Relevant portion from 'logs/hadoop.log' file. If you are clueless, send 
the complete file.
+ 

Reply via email to