Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: troubleshooting tips and information to be provided while asking for help ------------------------------------------------------------------------------ == Introduction == - 'protocol-httpclient' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. + 'protocol-httpclient' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. This feature can not do POST based authentication that depends on cookies. More information on this can be found at: HttpPostAuthentication == Necessity == - There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were re-written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used to configure authentication. The author (Susam Pal) of these features has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. + There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were re-written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used to configure authentication. == JIRA NUTCH-559 == These features were submitted as [https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] in the JIRA. If you have checked out the latest Nutch trunk, you don't need to apply the patches. These features were included in the Nutch subversion repository in [http://svn.apache.org/viewvc?view=rev&revision=608972 revision #608972] @@ -91, +91 @@ 'protocol-httpclient' is based on [http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons HttpClient]. Some servers support multiple schemes for authenticating users. Given that only one scheme may be used at a time for authenticating, it must choose which scheme to use. To accompish this, it uses an order of preference to select the correct authentication scheme. By default this order is: NTLM, Digest, Basic. For more information on the behavior during authentication, you might want to read the [http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html HttpClient Authentication Guide]. == Need Help? == - If you need help, please feel free to post your question to the [http://lucene.apache.org/nutch/mailing_lists.html#Users nutch-user mailing list]. + If you need help, please feel free to post your question to the [http://lucene.apache.org/nutch/mailing_lists.html#Users nutch-user mailing list]. The author of this work, Susam Pal, usually responds to mails related to authentication problems. The DEBUG logs may be required to troubleshoot the problem. You must enable the debug log for 'protocol-httpclient' before running the crawler. To enable debug log for 'protocol-httpclient', open 'conf/log4j.properties' and add the following line: + {{{ + log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout + }}} + It would be good to check the following things before asking for help. + + 1. Have you overridden the 'plugin.includes' property of 'conf/nutch-default.xml' with 'conf/nutch-site.xml' and replaced 'protocol-http' with 'protocol-httpclient'? + 1. If you patched Nutch 0.9 source code manually with this patch, did you build the project before running the crawler? + 1. Have you configured 'conf/httpclient-auth.xml'? + 1. Do you see Nutch trying to fetch the pages you were expecting in 'logs/hadoop.log'. You should see some logs like "fetching http://www.example.com/expectedpage.html" where the URL is the page you were expecting to be fetched. If you don't see such lines for the pages you were expecting, the error is outside the scope of this feature. This feature comes into action only when the crawler is fetching a page but the page requires authentication. + 1. With debug logs enabled, check whether there are logs beginning with "Credentials" in 'logs/hadoop.log'. The lines would look like "Credentials - username someuser; set ...". For every entry in 'conf/httpclient-auth.xml' you should find a corresponding log. If they are absent, probably you haven't included 'plugin.includes'. In case you have manually patched Nutch 0.9 source code with the patch, this issue may be caused if you have not built the project. + 1. Do you see logs like this: "auth.!AuthChallengeProcessor - basic authentication scheme selected"? Instead of the word 'basic', you might see 'digest' or 'NTLM' depending on the scheme supported by the page being fetched? If you do not see it at all, probably the web server or the page being fetched does not require authentication. In that case, the crawler would not try to authenticate. If you were expecting an authentication for the page, probably something needs to be fixed at the server side. + 1. You should also see some logs that begin with: "Pre-configured credentials with scope". It is very unlikely that this should happen after you have ensured all the above points. If it happens, please let us know in the mailing list. + + Once you have checked the items listed above and you are still unable to fix the problem or confused about any point listed above, please mail the issue with the following information: + + 1. Version of Nutch you are running. + 1. Did you get this feature directly from subversion or did you download the patch separately and apply? + 1. Relevant portion from 'logs/hadoop.log' file. If you are clueless, send the complete file. +