Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
role of http.agent.host in NTLM and patch committed

------------------------------------------------------------------------------
  == Necessity ==
  There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used to configure authentication. The author (Susam Pal) of 
these features has tested it in Infosys Technologies Limited by crawling the 
corporate intranet requiring NTLM authentication and this has been found to 
work well.
  
- == Download ==
- Currently, these features are present in the form of a patch in JIRA. 
Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA 
NUTCH-559] and apply it to trunk. The latest patch is named as 
[https://issues.apache.org/jira/secure/attachment/12370428/NUTCH-559v0.5.patch 
NUTCH-559v0.5.patch].
+ == JIRA NUTCH-559 ==
+ These features were submitted as 
[https://issues.apache.org/jira/browse/NUTCH-559 JIRA NUTCH-559] in the JIRA. 
If you have checked out the latest Nutch trunk, you don't need to apply the 
patches. These features were included in the Nutch subversion repository in 
[http://svn.apache.org/viewvc?view=rev&revision=608972 revision #608972]
  
  == Introduction to Authentication Scope ==
  Different credentials for different authentication scopes can be configured 
in 'conf/httpclient-auth.xml'. If a set of credentials is configured for a 
particular authentication scope (i.e. particular host, port number, realm 
and/or scheme), then that set of credentials would be sent only to pages 
falling under the specified authentication scope.
@@ -82, +82 @@

   1. For <authscope> tag, 'host' and 'port' attribute should always be 
specified. 'realm' and 'scheme' attributes may or may not be specified 
depending on your needs. If you are tempted to omit the 'host' and 'port' 
attribute, because you want the credentials to be used for any host and any 
port for that realm/scheme, please use the 'default' tag instead. That's what 
'default' tag is meant for.
   1. One authentication scope should not be defined twice as different 
<authscope> tags for different <credentials> tag. However, if this is done by 
mistake, the credentials for the last defined <authscope> tag would be used. 
This is because, the XML parsing code, reads the file from top to bottom and 
sets the credentials for authentication-scopes. If the same authentication 
scope is encountered once again, it will be overwritten with the new 
credentials. However, one should not rely on this behavior as this might change 
with further developments.
   1. Do not define multiple authscope tags with the same host, port but 
different realms if the server requires NTLM authentication. This can means 
there should not be multiple tags with same host, port, scheme="NTLM" but 
different realms. If you are omitting the scheme attribute and the server 
requires NTLM authentication, then there should not be multiple tags with same 
host, port but different realms. This is discussed more in the next section.
+  1. If you are using NTLM scheme, you should also set the 'http.agent.host' 
property in conf/nutch-site.xml
  
  === A note on NTLM domains ===
- NTLM does not use the concept of realms. Therefore, multiple realms for a 
web-server can not be defined as different authentication scopes for the same 
web-server requiring NTLM authentication. There should be exactly one authscope 
tag for NTLM scheme authentication scope for a particular web-server. The 
authentication domain should be specified as the value of the 'realm' attribute.
+ NTLM does not use the concept of realms. Therefore, multiple realms for a 
web-server can not be defined as different authentication scopes for the same 
web-server requiring NTLM authentication. There should be exactly one authscope 
tag for NTLM scheme authentication scope for a particular web-server. The 
authentication domain should be specified as the value of the 'realm' 
attribute. NTLM authentication also requires the name of IP address of the host 
on which the crawler is running. Thus, 'http.agent.host' should be set properly.
  
  == Underlying HttpClient Library ==
  'protocol-httpclient' is based on 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons 
HttpClient]. Some servers support multiple schemes for authenticating users. 
Given that only one scheme may be used at a time for authenticating, it must 
choose which scheme to use. To accompish this, it uses an order of preference 
to select the correct authentication scheme. By default this order is: NTLM, 
Digest, Basic. For more information on the behavior during authentication, you 
might want to read the 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html 
HttpClient Authentication Guide].

Reply via email to