Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by susam: http://wiki.apache.org/nutch/protocol-http11 The comment on the change is: content moved to HttpAuthenticationSchemes ------------------------------------------------------------------------------ + protocol-http11 has been converted to a patch for protocol-httpclient as per the discussion held at [https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557]. - == Introduction == - 'protocol-http11' is a protocol plugin which supports retrieving documents via the HTTP 1.0, HTTP 1.1 and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. + Therefore, the content of this page has been moved to HttpAuthenticationSchemes. - == Author == - Susam Pal, Infosys Technologies Limited - == Necessity == - There were two plugins already present, viz. 'protocol-http' and 'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' supported HTTPS and had code for NTLM authentication but the NTLM authentication didn't work due to a bug. 'protocol-http11' was written to solve these problems, provide additional features like authentication support for proxy server and better inline documentation for the properties to be used in 'nutch-site.xml' to enable 'protocol-http11' and use its authentication features. This is an improvement on the previous two plugins. The author of this plugin has tested it in Infosys Technologies Limited by crawling the corporate intranet requiring NTLM authentication and this has been found to work well. The name, 'protocol-http11' was chosen because, 'HTTP 1.1' is a valid protocol name. - - == Download == - Currently, this plugin is in the form of patch in JIRA. Download the patch from [https://issues.apache.org/jira/browse/NUTCH-557 JIRA NUTCH-557] and apply it to trunk. - - == Quick Guide == - This section is a quick guide to configure authentication related properties for 'protocol-http11'. - - 1. Include 'protocol-http11' in 'plugin.includes'. - 1. For basic or digest authentication in proxy server, set 'http.proxy.username' and 'http.proxy.password'. Also, set 'http.proxy.realm' if you want to specify a realm as the authentication scope. - 1. For NTLM authentication in proxy server, set 'http.proxy.username', 'http.proxy.password', 'http.proxy.realm' and 'http.auth.host'. 'http.proxy.realm' is the NTLM domain name. 'http.auth.host' is the host where the crawler is running. - 1. For basic or digest authentication in web servers, set 'http.auth.username' and 'http.auth.password'. Also, set 'http.auth.realm' if you want to specify a realm as the authentication scope. - 1. For NTLM authentication in proxy server, set 'http.auth.username', 'http.auth.password', 'http.auth.realm' and 'http.auth.host'. 'http.auth.realm' is the NTLM domain name. 'http.auth.host' is the host where the crawler is running. - 1. It is recommended that 'http.useHttp11' be set to true. - - This is explained in a little more detail in the next section. - - == Nutch Configuration == - To use 'protocol-http11', 'conf/nutch-site.xml has to be edited to include some properties which is explained in this section. First and foremost, to enable the plugin, this plugin must be added in the 'plugin.includes' of 'nutch-site.xml'. So, this property would typically look like:- - - {{{<property> - <name>plugin.includes</name> - <value>protocol-http11|urlfilter-regex|...</value> - <description>...</description> - </property>}}} - - (... indicates truncation) - - It is recommended that HTTP 1.1 should be enabled. - - {{{<property> - <name>http.useHttp11</name> - <value>true</value> - <description>...</description> - </property>}}} - - Next, if authentication is required for proxy server, the following properties need to be set in 'conf/nutch-site.xml'. - - * http.proxy.username - * http.proxy.password - * http.proxy.realm (If a realm needs to be provided. In case of NTLM authentication, the domain name should be provided as its value.) - * http.auth.host (This is required in case of NTLM authentication only. This is the host where the crawler would be running.) - - If the web servers of the intranet are in a particular domain or realm and requires authentication, these properties should be set in 'conf/nutch-site.xml'. - - * http.auth.username - * http.auth.password - * http.auth.realm - * http.auth.host - - The explanation for these properties are similar to that of the proxy authentication properties. As you might have noticed, 'http.auth.host' is used both for proxy NTLM authentication and web server NTLM authentication. Since, the host at which the HTTP requests are originating are same for both, so the same property is used for both and two different properties were not created. - - Even though, the 'http.auth.host' property is required only for NTLM authentication, it is advisable to set this for all cases, because, in case the crawler comes across a server which requires NTLM authentication (which you might not have anticipated), the crawler can still fetch the page. - - == Underlying HttpClient Library == - 'protocol-http11' is based on [http://jakarta.apache.org/httpcomponents/httpclient-3.x/ Jakarta Commons HttpClient]. Some servers support multiple schemes for authenticating users. Given that only one scheme may be used at a time for authenticating, it must choose which scheme to use. To accompish this, it uses an order of preference to select the correct authentication scheme. By default this order is: NTLM, Digest, Basic. For more information on the behavior during authentication, you might want to read the [http://jakarta.apache.org/httpcomponents/httpclient-3.x/authentication.html HttpClient Authentication Guide]. - - == Need Help? == - If you need help, please feel free to post your question to the [http://lucene.apache.org/nutch/mailing_lists.html#Users nutch-user mailing list]. -