Oh, and one more thing, copy the 'plugin.include' property from nutch-default.xml to your nutch-site.xml, and replace 'protocol-http' with 'protocol-httpclient'.
Am Mittwoch, den 08.08.2007, 14:40 -0400 schrieb Clarence Donath: > Ravi, you are a Nutch god. Thank you very much for your patch. > > In case anyone else happens across this thread in the future, I'd like > to record my notes for enabling BASIC authentication. > > With Ravi's patch, you must replace the first argument to the new > AuthScope with your host.domain. > > Build nutch with 'ant'. > > Replace plugins/protocol-httpclient/protocol-httpclient.jar with > build/protocol-httpclient/protocol-httpclient.jar > > Use the following in your nutch-site.xml... > > <property> > <name>http.auth.basic.username</name> > <value>your_spider_user</value> > <description>HTTP Basic Authentication</description> > </property> > > <property> > <name>http.auth.basic.password</name> > <value>your_spider_password</value> > <description>HTTP Basic Authentication</description> > </property> > > Note in the above that 'http.auth.basic.password' is different from > other examples you might find in the mailing list archive as > 'http.auth.basic.pass'. Also, the 'realm' part of the property is > irrelevant because of his use of ANY_REALM. > > The beautiful thing is you don't need to explicitly add protected URLs > to your urls/nutch file because the crawler will only attempt to index > any protected pages it finds. So you only have to 'require > your_spider_user' in the .htaccess files where you want to spider to > crawl. Then users will have to authenticate for protected areas from > the search results. > > Of course, you'll want to consider eliminate caching and the summary > information from search.jsp to protect sensitive information, but that's > left up to another excersize. :) > > And finally, return here and thank Ravi for his help! > > Regards, > Clarence Donath
