Indeed the answer is negative and also, many people have often asked
this in this list. Martin has very nicely explained the problems and
possible solution. I'll just add what I have thought of. I have often
wondered what it would take to create a nice configurable cookie based
authentication feature. The following file would be needed:-
- A configuration file with a list of domains for which authentication
should be done along with the login URL and POST data.
The behaviour of the fetcher should be as follows:-
1 - If the URL to be fetched is from a domain in the config file AND
this is the first time the fetcher is going to hit the domain, it
should first send the POST data to the login URL mentioned in the POST
data configuration file for that domain and obtain the session
2 - Save the cookies. (protocol-httpclient does this for a single fetch cycle).
3 - Request the actual URL that was supposed to be fetched.
4 - For further URLs from the same domain, authentication need not be
done for the same fetch cycle.
5.1 - However, if the URL redirects the page to the login URL (which
can be checked from the configuration file) authentication should be
done again, OR,
5.2 - if the URL returns an error page, authentication should be done again.
The cookie management part is already taken care of by the HttpClient
library. I find point no. 5.2 difficult to crack. The situation in 5.1
and 5.2 may occur only when the session times out in the same fetch
crawl. The authentication failure page may be returned as HTTP 200 OK
status which makes it more difficult. I could think of three ways to
solve this and I don't like the first two:-
1. We use regex match to find out whether the contents of the page
indicates it as an authentication failure page or not. But it is an
unnecessary waste of time because for most cases the page wouldn't be
an error page.
2. We perform an authentication by sending POST data to login URL
every time we fetch a page from that domain. By this we are almost
doubling the bandwidth requirement to crawl that website.
3. For those sites, where authentication failure page comes from a
known URL, we can add which URLs mean authentication failure along
with the login URL and POST data in the configuration file. There
wouldn't be too many such URLs for a particular domain and so a regex
match or a complete string match for the URLs after every response
from that domain shouldn't consume much time.
However, even without taking care of 5.1 and 5.2 we can have a
solution that may be useful to many. The solution for this problem is
very different from Http Authentication Schemes that I submitted as
NUTCH-559 because in NUTCH-559, the whole job of authentication can be
done within the protocol-httpclient plugin. However, in this, some job
has to be done in the fetcher, outside the plugin also.
If I get some free time, I'll try to work on this.
On Jan 6, 2008 12:11 AM, Martin Kuen <[EMAIL PROTECTED]> wrote:
> On Jan 5, 2008 6:50 PM, <[EMAIL PROTECTED]> wrote:
> > Hi,
> > I'm pretty sure the answer is negative, but I've got to ask - is support
> > for form-based authentication available somewhere within Nutch?
> > I believe Nutch does not support form-based auth, so the next question to
> > ask is - is there a suitable place to plug this in?
> You're right - not available . . . You can read/see what is available
> under http://wiki.apache.org/nutch/HttpAuthenticationSchemes
> I have not looked into this closely yet, but maybe some of you already
> went through this in your own Nutch-based projects. I am imagining a
> system where one would have a file with a bunch of username + password
> pairs + a form submission URL (see P.S. below), Nutch would read that
> and, before GETing a page from a matching site, it would POST the
> username+password via the form submission URL, get the cookie for the
> session, keep it stored somewhere and keep sending it back on
> subsequent GET requests.
> I think you are talking about container managed security.
> Unfortunatly, this approach will not work . . . read on below
> > I imagine this gets messy pretty quickly (e.g. session expires even though
> > cookie is still valid, so how is Nutch to detect this? How to patch URLs
> > to possible multiple sub-sections of a site that require one or more
> > form-based auth mechanisms? etc.), but again, maybe others have already
> > done some thinking in this space.
> > Thanks,
> > Otis
> > P.S.
> > Example file with info for form-based auth might look something like this:
> > u=username1 p=password1 http://site1.com/cgi-bin/login.cgi?
> > username=userX password=passX
> > http://www.example.com/foo/bar.do?button=q&
> The following is true for Container Managed Authentication:
> - You cannot directly acess the login page - There will be an error
> message saying "Invalid direct reference to login page"
> - If you acess a protected resource and you're not authenticated
> tomcat will issue 302 redirect to the login page. The original request
> will be saved inside the session.The browser follows the redirect and
> displays the login page. Then you can type in your uname + pwd and if
> everything is right you'll be redirected to the previously acessed
> resource (the previous/original request which is stored in the
> For me this boils down to:
> - Acess a (probably) protected resource.
> - Intercept the redirection and find out if it's one of the configured
> login pages. Alternativly one could try to find the "j_security_check"
> (,j_username, j_password) string. I am not sure about the exact
> - Do a POST to the login form using uname and pwd from config file
> - Follow the redirection and/or detect a redirect to the error-page
> (therefore, I think you need login AND error page in your config file)
> I am not sure if you can achieve these things by "just" writing a plugin.
> Have a look at the Fetcher class (I think
> org.apache.nutch.fetcher.Fetcher). There you find the code which deals
> with http redirections.
> Have a look the http-protocol plugin.
> However, you also may want to have a look at the authentication code
> in trunk by Susam Pal
> These are the three places where I imagine you'll find what you need.
> Hope this helps and good luck,