Oh, and one more thing, copy the 'plugin.include' property from
nutch-default.xml to your nutch-site.xml, and replace 'protocol-http'
with 'protocol-httpclient'.


Am Mittwoch, den 08.08.2007, 14:40 -0400 schrieb Clarence Donath:
> Ravi, you are a Nutch god.  Thank you very much for your patch.
> 
> In case anyone else happens across this thread in the future, I'd like
> to record my notes for enabling BASIC authentication.
> 
> With Ravi's patch, you must replace the first argument to the new
> AuthScope with your host.domain.
> 
> Build nutch with 'ant'.
> 
> Replace plugins/protocol-httpclient/protocol-httpclient.jar with
> build/protocol-httpclient/protocol-httpclient.jar
> 
> Use the following in your nutch-site.xml...
> 
> <property>
>   <name>http.auth.basic.username</name>
>   <value>your_spider_user</value>
>   <description>HTTP Basic Authentication</description>
> </property>
> 
> <property>
>   <name>http.auth.basic.password</name>
>   <value>your_spider_password</value>
>   <description>HTTP Basic Authentication</description>
> </property>
> 
> Note in the above that 'http.auth.basic.password' is different from
> other examples you might find in the mailing list archive as
> 'http.auth.basic.pass'.  Also, the 'realm' part of the property is
> irrelevant because of his use of ANY_REALM.
> 
> The beautiful thing is you don't need to explicitly add protected URLs
> to your urls/nutch file because the crawler will only attempt to index
> any protected pages it finds.  So you only have to 'require
> your_spider_user' in the .htaccess files where you want to spider to
> crawl.  Then users will have to authenticate for protected areas from
> the search results.
> 
> Of course, you'll want to consider eliminate caching and the summary
> information from search.jsp to protect sensitive information, but that's
> left up to another excersize.  :)
> 
> And finally, return here and thank Ravi for his help!
> 
> Regards,
> Clarence Donath

Reply via email to