Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "HttpAuthenticationSchemes" page has been changed by ArkadiKosmynin: http://wiki.apache.org/nutch/HttpAuthenticationSchemes?action=diff&rev1=24&rev2=25 == Underlying HttpClient Library == 'protocol-httpclient' is based on [[http://hc.apache.org/httpclient-3.x/|Jakarta Commons HttpClient]]. Some servers support multiple schemes for authenticating users. Given that only one scheme may be used at a time for authenticating, it must choose which scheme to use. To accomplish this, it uses an order of preference to select the correct authentication scheme. By default this order is: NTLM, Digest, Basic. For more information on the behavior during authentication, you might want to read the [[http://hc.apache.org/httpclient-3.x/authentication.html|HttpClient Authentication Guide]]. + == Troubleshooting == + If you are having problems with your authentication configuration, it is a good idea to step back, start with a very basic configuration, keep testing it and gradually adding to it until you get your desired configuration working. At the very start, check that the account that your crawler is using is enabled and working on the server(s). To do this, try to access one of your test URLs with a web browser. When prompted, enter the details of your crawler’s account. If this does not work, the problem is with the server and it will need to be fixed there. + + The configuration below can be used as a starting point. It provides minimum detail, allowing the client and server maximum flexibility. + {{{ + <auth-configuration> + <credentials username="crawler-user-name" password="crawler-password"> + <default realm="domain" /> + </credentials> + </auth-configuration> + }}} + + To check if your configuration is working, you can use the ParserChecker: + {{{ + ./nutch org.apache.nutch.parse.ParserChecker <your-test-URL> + }}} + + It is easy to see whether it has fetched the page successfully even without looking into logs. If it is successful, it will display a proper page title and many links extracted from the page. Otherwise, it will display the title like “You are not authorized to view this page” and few links, if any. + + If you look in the logs/hadoop.log file, search for the AuthChallengeProcessor records similar to this: + + {{{ + INFO auth.AuthChallengeProcessor - ntlm authentication scheme selected + }}} + + In case of failure, such a record will be followed by something like this: + + {{{ + INFO httpclient.HttpMethodDirector - Failure authenticating ... + }}} + == Need Help? == If you need help, please feel free to post your question to the [[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user mailing list]]. The author of this work, [[http://susam.in/|Susam Pal]], usually responds to mails related to authentication problems. The DEBUG logs may be required to troubleshoot the problem. You must enable the debug logging for 'protocol-httpclient' and Jakarta Commons !HttpClient before running the crawler. To enable debug logging for 'protocol-httpclient' and !HttpClient, open 'conf/log4j.properties' and add the following lines: {{{

