Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
removed conf/nutch-site.xml conf

------------------------------------------------------------------------------
  == Download ==
  Currently, these features are present in the form of a patch in JIRA. 
Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA 
NUTCH-559] and apply it to trunk.
  
+ == Configuration ==
+ This is an advanced feature that lets the user specify different credentials 
for different authentication scopes. This section does not describe the default 
configuration. Some parts of this section might be outdated. It is better to 
read the guidelines in 'conf/httpclient-auth.xml' because they are correct. 
This section will be improved later when time permits.
- == Common Credentials Configuration ==
- This is the simplest possible configuration which involves setting just one 
set of credentials. It is useful in trusted Intranets where all sites are 
trusted and require the same username/password for authentication.
- 
- === Quick Guide ===
-  1. Include 'protocol-httpclient' in 'plugin.includes'.
-  1. For basic or digest authentication in proxy server, set 
'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' 
if you want to specify a realm      as the authentication scope.
-  1. For NTLM authentication in proxy server, set 'http.proxy.username', 
'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 
'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where 
the crawler is running.
-  1. For basic or digest authentication in web servers, set 
'http.auth.username' and 'http.auth.password'. Also, set 'http.auth.realm' if 
you want to specify a realm as the authentication scope.
-  1. For NTLM authentication in proxy server, set 'http.auth.username', 
'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' 
is the NTLM domain name. 'http.auth.host' is the host where the crawler is 
running.
- 
- This is explained in details in the following section.
- 
- === Details ===
- To use 'protocol-httpclient', 'conf/nutch-site.xml has to be edited to 
include some properties which is explained in this section. First and foremost, 
to enable the plugin, this plugin must be added in the 'plugin.includes' of 
'nutch-site.xml'. So, this property would typically look like:-
- 
- {{{<property>
-   <name>plugin.includes</name>
-   <value>protocol-httpclient|urlfilter-regex|...</value>
-   <description>...</description>
- </property>}}}
- 
- (... indicates a long line truncated)
- 
- Next, if authentication is required for proxy server, the following 
properties need to be set in 'conf/nutch-site.xml'.
- 
-  * http.proxy.username
-  * http.proxy.password
-  * http.proxy.realm (If a realm needs to be provided. In case of NTLM 
authentication, the domain name should be provided as its value.)
-  * http.auth.host (This is required in case of NTLM authentication only. This 
is the host where the crawler would be running.)
- 
- If the web servers of the intranet are in a particular domain or realm and 
requires authentication, these properties should be set in 
'conf/nutch-site.xml'.
- 
-  * http.auth.username
-  * http.auth.password
-  * http.auth.realm
-  * http.auth.host
- 
- The explanation for these properties are similar to that of the proxy 
authentication properties. As you might have noticed, 'http.auth.host' is used 
for proxy NTLM authentication as well as web server NTLM authentication. Since, 
the host at which the HTTP requests are originating are same for both, so the 
same property is used for both and two different properties were not created.
- 
- Even though, the 'http.auth.host' property is required only for NTLM 
authentication, it is advisable to set this for all cases, because, in case the 
crawler comes across a server which requires NTLM authentication (which you 
might not have anticipated), the crawler can still fetch the page.
- 
- == Authentication Scope Specific Credentials ==
- This is an advanced feature that lets the user specify different credentials 
for different authentication scopes.
  
  === Quick Guide ===
  An example of 'conf/httpclient-auth.xml' configuration is provided below:
@@ -98, +57 @@

  
  The 'realm' attribute is optional in <authscope> tag and it can be omitted if 
you want the credentials to be used for all realms on a particular web-server 
(or all remaining realms as shown in the Quick Guide section above). One 
authentication scope should not be defined twice as different <authscope> tags 
for different <credentials> tag. However, if this is done by mistake, the 
credentials for the last defined <authscope> tag would be used. This is 
because, the XML parsing code, reads the file from top to bottom and sets the 
credentials for authentication-scopes. If the same authentication scope is 
encountered once again, it will be overwritten with the new credentials. 
However, one should not rely on this behavior as this might change with further 
developments.
  
- 
  == Underlying HttpClient Library ==
  'protocol-httpclient' is based on 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons 
HttpClient]. Some servers support multiple schemes for authenticating users. 
Given that only one scheme may be used at a time for authenticating, it must 
choose which scheme to use. To accompish this, it uses an order of preference 
to select the correct authentication scheme. By default this order is: NTLM, 
Digest, Basic. For more information on the behavior during authentication, you 
might want to read the 
[http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html 
HttpClient Authentication Guide].
  

Reply via email to