Switching from protocol-http to protocol-httpclient will help in
crawling secured sites (https).

If your site supports HTTP Basic authentication, then you can modify
the HTTP class in the protocol-httpclient plugin.

These are minor changes in the configureClient method:

client.getParams().setAuthenticationPreemptive(true); // This is
required if your site does /not throw an authentication challenge.

 client.getState().setCredentials(new AuthScope("site.com",
AuthScope.ANY_PORT, AuthScope.ANY_REALM), new User
namePasswordCredentials(username, password));

Replace the site with your site name (without the http or https
prefix), and include your login credentials for username and password.

You may also include the login credentials in the nutch conf file and read it.

Hope this helps.

- Ravi Chintakunta


On 10/12/06, Tomi NA <[EMAIL PROTECTED]> wrote:
> 2006/10/12, Guruprasad Iyer <[EMAIL PROTECTED]>:
> > Hi,
> >
> > I need to know how to crawl (intranet) sites which require authentication.
> > One suggestion was that I replace protocol-http with protocol-httpclient in
> > the value field of plugin.includes tag in the nutch-default.xml file.
> > However, this did not solve the problem.
> > Can you help me out on this? Thanks.
>
> I don't know what kind of authentication scheme you're up against, but
> recently I had to work with NTLM authentication in an intranet and
> worked arround it using a ntlmaps proxy. You tell nutch to use the
> proxy and you provide the proxy with adequate access priviledges. As
> simple as that and works like a charm. I imagine the nutch proxy
> support could be extended so that e.g. it selects a proxy based on
> regexp matching of urls. That way it would be possible to provide all
> the login/password pairs needed to crawl all of the sites you're
> interested in.
>
> t.n.a.
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to