Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by susam: http://wiki.apache.org/nutch/HttpAuthenticationSchemes The comment on the change is: re-writing document as per latest v0.5 patch ------------------------------------------------------------------------------ 'protocol-httpclient' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. == Necessity == - There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were re-written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used in 'nutch-site.xml' to enable 'protocol-httpclient' and use its authentication features. The author (Susam Pal) of these features has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. + There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. Some portions of 'protocol-httpclient' were re-written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used in 'httpclient-auth.xml' to enable 'protocol-httpclient' and use its authentication features. The author (Susam Pal) of these features has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. == Download == - Currently, these features are present in the form of a patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] and apply it to trunk. + Currently, these features are present in the form of a patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] and apply it to trunk. The latest patch is named as [https://issues.apache.org/jira/secure/attachment/12370428/NUTCH-559v0.5.patch NUTCH-559v0.5.patch]. == Configuration == - This is an advanced feature that lets the user specify different credentials for different authentication scopes. This section does not describe the default configuration. Some parts of this section might be outdated. It is better to read the guidelines in 'conf/httpclient-auth.xml' because they are correct. This section will be improved later when time permits. + Since the example and explanation provided as comments in 'conf/httpclient-auth.xml' is very crisp, therefore this section would explain it in more details. The section starts with a few very simple examples which would suffice for most real life situations. Complex cases are described later in this article. The root element is <auth-configuration> for all the examples below which has been omitted for the sake of clarity. + + === Crawling an intranet with default authentication scope === + Let's say all pages of an intranet are protected by basic, digest or ntlm authentication and there is only one set of credentials to be used for all web pages in the intranet, then a configuration as described below is enough. This is also the simplest possible configuration possible for authentication schemes. + + {{{<credentials username="susam" password="masus"> + <default/> + </credentials>}}} + + The credentials specified above would be sent to any page requesting authentication. Though it is extremely simple, default authentication scope should be used with caution. This set of credentials would be sent to any web-page requesting for authentication and therefore, a malicious user can steal the credentials used in the configuration by setting up a web-page requiring Basic authentication. Therefore, we usually use credentials set apart for crawling only so that even if a user steals the credentials, he wouldn't be able to do anything harmful. If you are sure, that all pages in the intranet use a particular authentication scheme, say, NTLM, then this situation can be improved a little in this manner. + + {{{<credentials username="susam" password="masus"> + <default scheme="ntlm"/> + </credentials>}}} + + Thus, these set of credentials would be sent to pages requesting NTLM authentication only. Now, one can not set up a page requiring Basic authentication and steal the credentials. === Quick Guide === An example of 'conf/httpclient-auth.xml' configuration is provided below: