Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
updated as per v0.5 patch

------------------------------------------------------------------------------
  'protocol-httpclient' is a protocol plugin which supports retrieving 
documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with 
Basic, Digest and NTLM authentication schemes for web server as well as proxy 
server.
  
  == Necessity ==
- There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used in 'httpclient-auth.xml' to enable 'protocol-httpclient' 
and use its authentication features. The author (Susam Pal) of these features 
has tested it in Infosys Technologies Limited by crawling the corporate 
intranet requiring NTLM authentication and this has been found to work well.
+ There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used to configure authentication. The author (Susam Pal) of 
these features has tested it in Infosys Technologies Limited by crawling the 
corporate intranet requiring NTLM authentication and this has been found to 
work well.
  
  == Download ==
  Currently, these features are present in the form of a patch in JIRA. 
Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA 
NUTCH-559] and apply it to trunk. The latest patch is named as 
[https://issues.apache.org/jira/secure/attachment/12370428/NUTCH-559v0.5.patch 
NUTCH-559v0.5.patch].
  
+ == Introduction to Authentication Scope ==
+ Different credentials for different authentication scopes can be configured 
in 'conf/httpclient-auth.xml'. If a set of credentials is configured for a 
particular authentication scope (i.e. particular host, port number, realm 
and/or scheme), then that set of credentials would be sent only to pages 
falling under the specified authentication scope.
+   
+ When authentication is required to fetch a resource from a web-server, the 
authentication-scope is determined from the host and port obtained from the URL 
of the page. If it matches any 'authscope' in this configuration file, then the 
'credentials' for that 'authscope' is used for authentication.
+ 
  == Configuration ==
- Since the example and explanation provided as comments in 
'conf/httpclient-auth.xml' is very crisp, therefore this section would explain 
it in more details. The section starts with a few very simple examples which 
would suffice for most real life situations. Complex cases are described later 
in this article. The root element is <auth-configuration> for all the examples 
below which has been omitted for the sake of clarity.
+ Since the example and explanation provided as comments in 
'conf/httpclient-auth.xml' is very brief, therefore this section would explain 
it in a little more detail. The root element is <auth-configuration> for all 
the examples below which has been omitted for the sake of clarity.
  
- === Crawling an intranet with default authentication scope ===
+ === Crawling an Intranet with Default Authentication Scope ===
  Let's say all pages of an intranet are protected by basic, digest or ntlm 
authentication and there is only one set of credentials to be used for all web 
pages in the intranet, then a configuration as described below is enough. This 
is also the simplest possible configuration possible for authentication schemes.
  
+ {{{
- {{{<credentials username="susam" password="masus">
+ <credentials username="susam" password="masus">
   <default/>
- </credentials>}}}
+ </credentials>
+ }}}
  
- The credentials specified above would be sent to any page requesting 
authentication. Though it is extremely simple, default authentication scope 
should be used with caution. This set of credentials would be sent to any 
web-page requesting for authentication and therefore, a malicious user can 
steal the credentials used in the configuration by setting up a web-page 
requiring Basic authentication. Therefore, we usually use credentials set apart 
for crawling only so that even if a user steals the credentials, he wouldn't be 
able to do anything harmful. If you are sure, that all pages in the intranet 
use a particular authentication scheme, say, NTLM, then this situation can be 
improved a little in this manner.
+ The credentials specified above would be sent to any page requesting 
authentication. Though it is extremely simple, default authentication scope 
should be used with caution. This set of credentials would be sent to any 
web-page requesting for authentication and therefore, a malicious user can 
steal the credentials used in the configuration by setting up a web-page 
requiring Basic authentication. Therefore, we usually use credentials set apart 
for crawling only, so that even if a user steals the credentials, he wouldn't 
be able to do anything harmful. If you are sure, that all pages in the intranet 
use a particular authentication scheme, say, NTLM, then this situation can be 
improved a little in this manner.
- 
- {{{<credentials username="susam" password="masus">
-  <default scheme="ntlm"/>
- </credentials>}}}
- 
- Thus, these set of credentials would be sent to pages requesting NTLM 
authentication only. Now, one can not set up a page requiring Basic 
authentication and steal the credentials.
- 
- === Quick Guide ===
- An example of 'conf/httpclient-auth.xml' configuration is provided below:
  
  {{{
- <auth-configuration>
-   <credentials username="susam" password="masus">
+ <credentials username="susam" password="masus">
+  <default scheme="ntlm"/>
-     <authscope host="192.168.101.33" port="80" realm="login"/>
-     <authscope host="example" port="8080" realm="blogs"/>
-     <authscope host="example" port="8080" realm="wiki"/>
-   </credentials>
+ </credentials>
-   <credentials username="admin" password="nimda">
-     <authscope host="example" port="8080"/>
-   </credentials>
- </auth-configuration>
  }}}
  
- If a page from '192.168.101.33:80' requests authentication for 'login' realm, 
the first set of credentials is used for authentication. If a page from 
'example:8080' requests authentication for 'blogs' or 'wiki' realms the first 
set of credentials is used. For all other realms in 'example:8080', the second 
set of credentials is used.
+ Thus, this set of credentials would be sent to pages requesting NTLM 
authentication only. Now, one can not set up a page requiring Basic 
authentication and steal the credentials. NTLM is safer, because password is 
not sent in clear-text or in a form from which the original password can be 
recovered directly.
  
+ === Credentials for Specific Authentication Scopes ===
+ The following is an example that shows how two sets of credentials have been 
defined for different authentication scopes. 
+ For all pages of example:8080 requiring authentication in the 'blogs' or 
'wiki' realm, the first set of credentials would be used. 
- The 'http.auth.host' property must be set in 'conf/nutch-site.xml' because it 
is used for authentication scope specific authentication too. Other 'http.auth' 
properties in 'conf/nutch-site.xml' may be left blank if you do now want to set 
common credentials.
- 
- === Details ===
- Different credentials for different authentication scopes can be configured 
in 'conf/httpclient-auth.xml'. If a set of credentials is configured for a 
particular authentication scope (i.e. particular host, port number and/or 
realm), then that set of credentials would be sent only to servers falling 
under the specified authentication scope.
-   
- When authentication is required to fetch a resource from a web-server, the 
authentication-scope is determined from the host and port obtained from the URL 
of the page. If it matches any 'authscope' in this configuration file, then the 
'credentials' for that 'authscope' is used for authentication. Otherwise, the 
common authentication details mentioned in the Nutch configuration file is used.
- 
- If there are several pages having different authentication realms on the same 
web-server (i.e. same host and port, but different realms), and credentials for 
one or more of the realms is specified in this file, then Nutch would 
completely ignore the common credentials in Nutch configuration file for that 
web-server (i.e. for that host and port). So, credentials to handle all realms 
for that server may be specified in this file.
- 
- Let's assume some credentials are set in 'conf/nutch-site.xml' and 
'conf/httpclient-auth.xml' has only one entry as follows:=
  
  {{{
-   <credentials username="susam" password="masus">
+ <credentials username="susam" password="masus">
-     <authscope host="192.168.101.33" port="80" realm="login"/>
-     <authscope host="example" port="8080" realm="blogs"/>
+   <authscope host="example" port="8080" realm="blogs"/>
-     <authscope host="example" port="8080" realm="wiki"/>
+   <authscope host="example" port="8080" realm="wiki"/>
-   </credentials>
+ </credentials>
+ <credentials username="admin" password="nimda">
+   <default/>
+ </credentials>
  }}}
  
- If a page, say, 'http://192.168.101.33/index.jsp' requires authentication, 
the above credentials would be used.
+ However, an important thing to note here is that if some page of example:8080 
requires authentication in another realm, say, 'mail', authentication would not 
be done even though the second set of credentials is defined as default. Of 
course this doesn't affect authentication for other web servers and the default 
authscope would be used for other web-servers. This problem occurs only for 
those web-servers which have authentication scopes defined for a few selected 
realms/schemes. This is discussed in next section.
  
- The above credentials would be used if a page, say, 
'http://example:8080/index.jsp' requires authentication for "blogs" realm or 
doesn't have realm information in the HTTP response. However, if it requires 
authentication for "main" realm, authentication would fail since no credentials 
have been defined for this particular scope. The common credentials in 
'conf/nutch-site.xml' would not be used because if there is atleast one 
<authscope> tag for a particular host:port combination, then 
'conf/nutch-site.xml' is not consulted for any requests from the same 
web-server (i.e same host:port).
+ === Catch-all Authentication Scope for a Web Server ===
+ When one or more authentication scopes are defined for a particular web 
server (host:port), then the default credentials is ignored for that host:port 
combination. Therefore, an catch-all authentication scope to handle all other 
realms and scopes must be specified explicitly as shown below.
  
- If a page, say, 'http://192.168.101.34/index.jsp' requires authentication, 
then the common credentials would be used since there is no credential defined 
for this scope.
+ {{{
+ <credentials username="susam" password="masus">
+   <authscope host="example" port="8080" realm="blogs"/>
+   <authscope host="example" port="8080" realm="wiki"/>
+ </credentials>
+ <credentials username="admin" password="nimda">
+   <default/>
+   <authscope host="example" port="8080"/>
+ </credentials>
+ }}}
  
+ The last authscope tag for example:8080 acts as the catch all authentication 
scope. In this section, realms were used to demonstrate the example. The same 
holds true for schemes also. For example, in the following example, the last 
authscope tag is necessary if the second set of credentials must be used for 
all pages of example:8080 not belonging to the authentication scope defined in 
the first tag.
+ 
+ {{{
+ <credentials username="susam" password="masus">
+   <authscope host="example" port="8080" realm="blogs" scheme="DIGEST"/>
+ </credentials>
+ <credentials username="admin" password="nimda">
+   <default/>
+   <authscope host="example" port="8080"/>
+ </credentials>
+ }}}
+ 
+ === Important Points ===
+  1. For <authscope> tag, 'host' and 'port' attribute should always be 
specified. 'realm' and 'scheme' attributes may or may not be specified 
depending on your needs. If you are tempted to omit the 'host' and 'port' 
attribute, because you want the credentials to be used for any host and any 
port for that realm/scheme, please use the 'default' tag instead. That's what 
'default' tag is meant for.
- The 'realm' attribute is optional in <authscope> tag and it can be omitted if 
you want the credentials to be used for all realms on a particular web-server 
(or all remaining realms as shown in the Quick Guide section above). One 
authentication scope should not be defined twice as different <authscope> tags 
for different <credentials> tag. However, if this is done by mistake, the 
credentials for the last defined <authscope> tag would be used. This is 
because, the XML parsing code, reads the file from top to bottom and sets the 
credentials for authentication-scopes. If the same authentication scope is 
encountered once again, it will be overwritten with the new credentials. 
However, one should not rely on this behavior as this might change with further 
developments.
+  1. One authentication scope should not be defined twice as different 
<authscope> tags for different <credentials> tag. However, if this is done by 
mistake, the credentials for the last defined <authscope> tag would be used. 
This is because, the XML parsing code, reads the file from top to bottom and 
sets the credentials for authentication-scopes. If the same authentication 
scope is encountered once again, it will be overwritten with the new 
credentials. However, one should not rely on this behavior as this might change 
with further developments.
+  1. Do not define multiple authscope tags with the same host, port but 
different realms if the server requires NTLM authentication. This can means 
there should not be multiple tags with same host, port, scheme="NTLM" but 
different realms. If you are omitting the scheme attribute and the server 
requires NTLM authentication, then there should not be multiple tags with same 
host, port but different realms. This is discussed more in the next section.
+ 
+ === A note on NTLM domains ===
+ NTLM does not use the concept of realms. Therefore, multiple realms for a 
web-server can not be defined as different authentication scopes for the same 
web-server requiring NTLM authentication. There should be exactly one authscope 
tag for NTLM scheme authentication scope for a particular web-server. The 
authentication domain should be specified as the value of the 'realm' attribute.
  
  == Underlying HttpClient Library ==
  'protocol-httpclient' is based on 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons 
HttpClient]. Some servers support multiple schemes for authenticating users. 
Given that only one scheme may be used at a time for authenticating, it must 
choose which scheme to use. To accompish this, it uses an order of preference 
to select the correct authentication scheme. By default this order is: NTLM, 
Digest, Basic. For more information on the behavior during authentication, you 
might want to read the 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html 
HttpClient Authentication Guide].

Reply via email to