[Nutch Wiki] Update of "HttpAuthenticationSchemes" by ArkadiKosmynin

Apache Wiki Tue, 29 Nov 2011 21:32:39 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "HttpAuthenticationSchemes" page has been changed by ArkadiKosmynin:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes?action=diff&rev1=24&rev2=25

  == Underlying HttpClient Library ==
  'protocol-httpclient' is based on 
[[http://hc.apache.org/httpclient-3.x/|Jakarta Commons HttpClient]]. Some 
servers support multiple schemes for authenticating users. Given that only one 
scheme may be used at a time for authenticating, it must choose which scheme to 
use. To accomplish this, it uses an order of preference to select the correct 
authentication scheme. By default this order is: NTLM, Digest, Basic. For more 
information on the behavior during authentication, you might want to read the 
[[http://hc.apache.org/httpclient-3.x/authentication.html|HttpClient 
Authentication Guide]].
  
+ == Troubleshooting ==
+ If you are having problems with your authentication configuration, it is a 
good idea to step back, start with a very basic configuration, keep testing it 
and gradually adding to it until you get your desired configuration working. At 
the very start, check that the account that your crawler is using is enabled 
and working on the server(s). To do this, try to access one of your test URLs 
with a web browser. When prompted, enter the details of your crawler’s account. 
If this does not work, the problem is with the server and it will need to be 
fixed there.
+ 
+ The configuration below can be used as a starting point. It provides minimum 
detail, allowing the client and server maximum flexibility.
+ {{{
+ <auth-configuration>
+   <credentials username="crawler-user-name" password="crawler-password">
+     <default realm="domain" />
+   </credentials>
+ </auth-configuration>
+ }}}
+ 
+ To check if your configuration is working, you can use the ParserChecker:
+ {{{
+ ./nutch org.apache.nutch.parse.ParserChecker <your-test-URL>
+ }}}
+ 
+ It is easy to see whether it has fetched the page successfully even without 
looking into logs. If it is successful, it will display a proper page title and 
many links extracted from the page. Otherwise, it will display the title like 
“You are not authorized to view this page” and few links, if any.
+  
+ If you look in the logs/hadoop.log file, search for the 
AuthChallengeProcessor records similar to this:
+ 
+ {{{
+ INFO  auth.AuthChallengeProcessor - ntlm authentication scheme selected
+ }}}
+  
+ In case of failure, such a record will be followed by something like this:
+ 
+ {{{
+ INFO  httpclient.HttpMethodDirector - Failure authenticating ...
+ }}}
+ 
  == Need Help? ==
  If you need help, please feel free to post your question to the 
[[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user mailing 
list]]. The author of this work, [[http://susam.in/|Susam Pal]], usually 
responds to mails related to authentication problems. The DEBUG logs may be 
required to troubleshoot the problem. You must enable the debug logging for 
'protocol-httpclient' and Jakarta Commons !HttpClient before running the 
crawler. To enable debug logging for 'protocol-httpclient' and !HttpClient, 
open 'conf/log4j.properties' and add the following lines:
  {{{

[Nutch Wiki] Update of "HttpAuthenticationSchemes" by ArkadiKosmynin

Reply via email to