Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "HttpAuthenticationSchemes" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/HttpAuthenticationSchemes?action=diff&rev1=26&rev2=27

  When authentication is required to fetch a resource from a web-server, the 
authentication-scope is determined from the host and port obtained from the URL 
of the page. If it matches any 'authscope' in this configuration file, then the 
'credentials' for that 'authscope' is used for authentication.
  
  == Configuration ==
- Since the example and explanation provided as comments in 
'conf/httpclient-auth.xml' is very brief, therefore this section would explain 
it in a little more detail. In all the examples below, the root element 
<auth-configuration> has been omitted for the sake of clarity.
+ Since the example and explanation provided as comments in 
'conf/httpclient-auth.xml' is very brief, therefore this section would explain 
it in a little more detail. '''In all the examples below, the root element 
<auth-configuration> has been omitted for the sake of clarity'''.
  
  === Prerequisites ===
  In order to use HTTP Authentication, the Nutch crawler must be configured to 
use 'protocol-httpclient' instead of the default 'protocol-http'. To do this 
copy 'plugin.includes' property from 'conf/nutch-default.xml' into 
'conf/nutch-site.xml'. Replace 'protocol-http' with 'protocol-httpclient' in 
the value of the property. If you have made no other changes it should look as 
follows:
  {{{
  <property>
    <name>plugin.includes</name>
-   
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
+   
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    <description>Regular expression naming plugin directory names to
    include.  Any plugin not matching this expression is excluded.
-   In any case you need at least include the nutch-extensionpoints plugin. By
+   In any case you need at least include the nutch-extensionpoints plugin. 
+   In order to use HTTPS please enable 
-   default Nutch includes crawling just HTML and plain text via HTTP,
-   and basic indexing and search plugins. In order to use HTTPS please enable 
    protocol-httpclient, but be aware of possible intermittent problems with 
the 
    underlying commons-httpclient library.
    </description>
@@ -131, +130 @@

  </auth-configuration>
  }}}
  
+ To check if your configuration is working, you can use the 
[[http://wiki.apache.org/nutch/bin/nutch%20parsechecker|ParserChecker]]
+ 
- To check if your configuration is working, you can use the ParserChecker:
- {{{
- ./nutch org.apache.nutch.parse.ParserChecker <your-test-URL>
- }}}
  
  It is easy to see whether it has fetched the page successfully even without 
looking into logs. If it is successful, it will display a proper page title and 
many links extracted from the page. Otherwise, it will display a title like 
“You are not authorized to view this page” and few links, if any.
   
@@ -151, +148 @@

  }}}
  
  == Need Help? ==
- If you need help, please feel free to post your question to the 
[[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user mailing 
list]]. The author of this work, [[http://susam.in/|Susam Pal]], usually 
responds to mails related to authentication problems. The DEBUG logs may be 
required to troubleshoot the problem. You must enable the debug logging for 
'protocol-httpclient' and Jakarta Commons !HttpClient before running the 
crawler. To enable debug logging for 'protocol-httpclient' and !HttpClient, 
open 'conf/log4j.properties' and add the following lines:
+ If you need help, please feel free to post your question to the 
[[http://nutch.apache.org/mailing_lists.html|user@nutch mailing list]]. The 
author of this work, [[http://susam.in/|Susam Pal]], usually responds to mails 
related to authentication problems. The DEBUG logs may be required to 
troubleshoot the problem. You must enable the debug logging for 
'protocol-httpclient' and Jakarta Commons !HttpClient before running the 
crawler. To enable debug logging for 'protocol-httpclient' and !HttpClient, 
open 'conf/log4j.properties' and add the following lines:
  {{{
  log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
  log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout

Reply via email to