Ravi, you are a Nutch god.  Thank you very much for your patch.

In case anyone else happens across this thread in the future, I'd like
to record my notes for enabling BASIC authentication.

With Ravi's patch, you must replace the first argument to the new
AuthScope with your host.domain.

Build nutch with 'ant'.

Replace plugins/protocol-httpclient/protocol-httpclient.jar with
build/protocol-httpclient/protocol-httpclient.jar

Use the following in your nutch-site.xml...

<property>
  <name>http.auth.basic.username</name>
  <value>your_spider_user</value>
  <description>HTTP Basic Authentication</description>
</property>

<property>
  <name>http.auth.basic.password</name>
  <value>your_spider_password</value>
  <description>HTTP Basic Authentication</description>
</property>

Note in the above that 'http.auth.basic.password' is different from
other examples you might find in the mailing list archive as
'http.auth.basic.pass'.  Also, the 'realm' part of the property is
irrelevant because of his use of ANY_REALM.

The beautiful thing is you don't need to explicitly add protected URLs
to your urls/nutch file because the crawler will only attempt to index
any protected pages it finds.  So you only have to 'require
your_spider_user' in the .htaccess files where you want to spider to
crawl.  Then users will have to authenticate for protected areas from
the search results.

Of course, you'll want to consider eliminate caching and the summary
information from search.jsp to protect sensitive information, but that's
left up to another excersize.  :)

And finally, return here and thank Ravi for his help!

Regards,
Clarence Donath


Am Mittwoch, den 08.08.2007, 10:16 -0400 schrieb Ravi Chintakunta:
> Hi Clarence,
> 
> The properties entered in the nutch-site.xml does not seem to be used
> in HttpClient. Please apply the below patch to
> nutch/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
> and that should help.
> 
> - Ravi Chintakunta
> 
> 
> 
> @@ -31,6 +31,7 @@
>  import org.apache.commons.httpclient.HttpClient;
>  import org.apache.commons.httpclient.MultiThreadedHttpConnectionManager;
>  import org.apache.commons.httpclient.NTCredentials;
> +import org.apache.commons.httpclient.UsernamePasswordCredentials;
>  import org.apache.commons.httpclient.auth.AuthScope;
>  import org.apache.commons.httpclient.params.HttpConnectionManagerParams;
>  import org.apache.commons.httpclient.protocol.Protocol;
> @@ -65,6 +66,8 @@
>    String ntlmPassword = "";
>    String ntlmDomain = "";
>    String ntlmHost = "";
> +  String basicUsername = "";
> +  String basicPassword = "";
> 
>    public Http() {
>      super(LOG);
> @@ -77,6 +80,8 @@
>      this.ntlmPassword = conf.get("http.auth.ntlm.password", "");
>      this.ntlmDomain = conf.get("http.auth.ntlm.domain", "");
>      this.ntlmHost = conf.get("http.auth.ntlm.host", "");
> +    basicUsername = conf.get("http.auth.basic.username");
> +    basicPassword = conf.get("http.auth.basic.password");
>      //Level logLevel = Level.WARNING;
>      //if (conf.getBoolean("http.verbose", false)) {
>      //  logLevel = Level.FINE;
> @@ -131,6 +136,7 @@
>      if (useProxy) {
>        hostConf.setProxy(proxyHost, proxyPort);
>      }
> +    /*
>      if (ntlmUsername.length() > 0) {
>        Credentials ntCreds = new NTCredentials(ntlmUsername,
> ntlmPassword, ntlmHost, ntlmDomain);
>        client.getState().setCredentials(new AuthScope(ntlmHost,
> AuthScope.ANY_PORT), ntCreds);
> @@ -139,6 +145,11 @@
>          LOG.info("Added NTLM credentials for " + ntlmUsername);
>        }
>      }
> +    */
> +
> +    client.getParams().setAuthenticationPreemptive(true);
> +    if (LOG.isInfoEnabled()) { LOG.info("**** setting basic auth
> credentials ****"); }
> +    client.getState().setCredentials(new
> AuthScope("linuxlink.timesys.com", AuthScope.ANY_PORT,
> AuthScope.ANY_REALM), new UsernamePasswordCrede
> ntials(basicUsername, basicPassword));
>      if (LOG.isInfoEnabled()) { LOG.info("Configured Client"); }
>    }
>  }

Reply via email to