Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "HttpAuthenticationSchemes" page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/HttpAuthenticationSchemes?action=diff&rev1=26&rev2=27 When authentication is required to fetch a resource from a web-server, the authentication-scope is determined from the host and port obtained from the URL of the page. If it matches any 'authscope' in this configuration file, then the 'credentials' for that 'authscope' is used for authentication. == Configuration == - Since the example and explanation provided as comments in 'conf/httpclient-auth.xml' is very brief, therefore this section would explain it in a little more detail. In all the examples below, the root element <auth-configuration> has been omitted for the sake of clarity. + Since the example and explanation provided as comments in 'conf/httpclient-auth.xml' is very brief, therefore this section would explain it in a little more detail. '''In all the examples below, the root element <auth-configuration> has been omitted for the sake of clarity'''. === Prerequisites === In order to use HTTP Authentication, the Nutch crawler must be configured to use 'protocol-httpclient' instead of the default 'protocol-http'. To do this copy 'plugin.includes' property from 'conf/nutch-default.xml' into 'conf/nutch-site.xml'. Replace 'protocol-http' with 'protocol-httpclient' in the value of the property. If you have made no other changes it should look as follows: {{{ <property> <name>plugin.includes</name> - <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> + <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. - In any case you need at least include the nutch-extensionpoints plugin. By + In any case you need at least include the nutch-extensionpoints plugin. + In order to use HTTPS please enable - default Nutch includes crawling just HTML and plain text via HTTP, - and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> @@ -131, +130 @@ </auth-configuration> }}} + To check if your configuration is working, you can use the [[http://wiki.apache.org/nutch/bin/nutch%20parsechecker|ParserChecker]] + - To check if your configuration is working, you can use the ParserChecker: - {{{ - ./nutch org.apache.nutch.parse.ParserChecker <your-test-URL> - }}} It is easy to see whether it has fetched the page successfully even without looking into logs. If it is successful, it will display a proper page title and many links extracted from the page. Otherwise, it will display a title like “You are not authorized to view this page” and few links, if any. @@ -151, +148 @@ }}} == Need Help? == - If you need help, please feel free to post your question to the [[http://lucene.apache.org/nutch/mailing_lists.html#Users|nutch-user mailing list]]. The author of this work, [[http://susam.in/|Susam Pal]], usually responds to mails related to authentication problems. The DEBUG logs may be required to troubleshoot the problem. You must enable the debug logging for 'protocol-httpclient' and Jakarta Commons !HttpClient before running the crawler. To enable debug logging for 'protocol-httpclient' and !HttpClient, open 'conf/log4j.properties' and add the following lines: + If you need help, please feel free to post your question to the [[http://nutch.apache.org/mailing_lists.html|user@nutch mailing list]]. The author of this work, [[http://susam.in/|Susam Pal]], usually responds to mails related to authentication problems. The DEBUG logs may be required to troubleshoot the problem. You must enable the debug logging for 'protocol-httpclient' and Jakarta Commons !HttpClient before running the crawler. To enable debug logging for 'protocol-httpclient' and !HttpClient, open 'conf/log4j.properties' and add the following lines: {{{ log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout log4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout

