Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

The comment on the change is:
updated for 'protocol-httpclient'

------------------------------------------------------------------------------
  Susam Pal, Infosys Technologies Limited
  
  == Necessity ==
- There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used in 'nutch-site.xml' to enable 'protocol-http11' and use 
its authentication features. This is an improvement on the previous two 
plugins. The author of the authentication features has tested it in Infosys 
Technologies Limited by crawling the corporate intranet requiring NTLM 
authentication and this has been found to work well.
+ There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used in 'nutch-site.xml' to enable 'protocol-httpclient' and 
use its authentication features. The author of the authentication features has 
tested it in Infosys Technologies Limited by crawling the corporate intranet 
requiring NTLM authentication and this has been found to work well.
  
  == Download ==
- Currently, this plugin is in the form of patch in JIRA. Download the patch 
from [https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557] and apply 
it to trunk.
+ Currently, these features are present in the form of a patch in JIRA. 
Download the patch from [https://issues.apache.org/jira/browse/NUTCH-559 JIRA 
NUTCH-559] and apply it to trunk.
  
  == Quick Guide ==
  This section is a quick guide to configure authentication related properties 
for 'protocol-httpclient'.
@@ -49, +49 @@

   * http.auth.realm
   * http.auth.host
  
- The explanation for these properties are similar to that of the proxy 
authentication properties. As you might have noticed, 'http.auth.host' is used 
both for proxy NTLM authentication and web server NTLM authentication. Since, 
the host at which the HTTP requests are originating are same for both, so the 
same property is used for both and two different properties were not created.
+ The explanation for these properties are similar to that of the proxy 
authentication properties. As you might have noticed, 'http.auth.host' is used 
for proxy NTLM authentication as well as web server NTLM authentication. Since, 
the host at which the HTTP requests are originating are same for both, so the 
same property is used for both and two different properties were not created.
  
  Even though, the 'http.auth.host' property is required only for NTLM 
authentication, it is advisable to set this for all cases, because, in case the 
crawler comes across a server which requires NTLM authentication (which you 
might not have anticipated), the crawler can still fetch the page.
  

Reply via email to