Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
re-writing document as per latest v0.5 patch

------------------------------------------------------------------------------
  'protocol-httpclient' is a protocol plugin which supports retrieving 
documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with 
Basic, Digest and NTLM authentication schemes for web server as well as proxy 
server.
  
  == Necessity ==
- There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used in 'nutch-site.xml' to enable 'protocol-httpclient' and 
use its authentication features. The author (Susam Pal) of these features has 
tested it in Infosys Technologies Limited by crawling the corporate intranet 
requiring NTLM authentication and this has been found to work well.
+ There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used in 'httpclient-auth.xml' to enable 'protocol-httpclient' 
and use its authentication features. The author (Susam Pal) of these features 
has tested it in Infosys Technologies Limited by crawling the corporate 
intranet requiring NTLM authentication and this has been found to work well.
  
  == Download ==
- Currently, these features are present in the form of a patch in JIRA. 
Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA 
NUTCH-559] and apply it to trunk.
+ Currently, these features are present in the form of a patch in JIRA. 
Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA 
NUTCH-559] and apply it to trunk. The latest patch is named as 
[https://issues.apache.org/jira/secure/attachment/12370428/NUTCH-559v0.5.patch 
NUTCH-559v0.5.patch].
  
  == Configuration ==
- This is an advanced feature that lets the user specify different credentials 
for different authentication scopes. This section does not describe the default 
configuration. Some parts of this section might be outdated. It is better to 
read the guidelines in 'conf/httpclient-auth.xml' because they are correct. 
This section will be improved later when time permits.
+ Since the example and explanation provided as comments in 
'conf/httpclient-auth.xml' is very crisp, therefore this section would explain 
it in more details. The section starts with a few very simple examples which 
would suffice for most real life situations. Complex cases are described later 
in this article. The root element is <auth-configuration> for all the examples 
below which has been omitted for the sake of clarity.
+ 
+ === Crawling an intranet with default authentication scope ===
+ Let's say all pages of an intranet are protected by basic, digest or ntlm 
authentication and there is only one set of credentials to be used for all web 
pages in the intranet, then a configuration as described below is enough. This 
is also the simplest possible configuration possible for authentication schemes.
+ 
+ {{{<credentials username="susam" password="masus">
+  <default/>
+ </credentials>}}}
+ 
+ The credentials specified above would be sent to any page requesting 
authentication. Though it is extremely simple, default authentication scope 
should be used with caution. This set of credentials would be sent to any 
web-page requesting for authentication and therefore, a malicious user can 
steal the credentials used in the configuration by setting up a web-page 
requiring Basic authentication. Therefore, we usually use credentials set apart 
for crawling only so that even if a user steals the credentials, he wouldn't be 
able to do anything harmful. If you are sure, that all pages in the intranet 
use a particular authentication scheme, say, NTLM, then this situation can be 
improved a little in this manner.
+ 
+ {{{<credentials username="susam" password="masus">
+  <default scheme="ntlm"/>
+ </credentials>}}}
+ 
+ Thus, these set of credentials would be sent to pages requesting NTLM 
authentication only. Now, one can not set up a page requiring Basic 
authentication and steal the credentials.
  
  === Quick Guide ===
  An example of 'conf/httpclient-auth.xml' configuration is provided below:

Reply via email to