2006/10/12, Guruprasad Iyer <[EMAIL PROTECTED]>:
Hi,
I need to know how to crawl (intranet) sites which require authentication.
One suggestion was that I replace protocol-http with protocol-httpclient in
the value field of plugin.includes tag in the nutch-default.xml file.
However, this did not solve the problem.
Can you help me out on this? Thanks.
I don't know what kind of authentication scheme you're up against, but
recently I had to work with NTLM authentication in an intranet and
worked arround it using a ntlmaps proxy. You tell nutch to use the
proxy and you provide the proxy with adequate access priviledges. As
simple as that and works like a charm. I imagine the nutch proxy
support could be extended so that e.g. it selects a proxy based on
regexp matching of urls. That way it would be possible to provide all
the login/password pairs needed to crawl all of the sites you're
interested in.
t.n.a.