Hi, This would be a great feature to have, it's just what I need! Has any progress been made? I'd be more than happy to test out a beta version if I can.
Cheers, Iwan Susam Pal wrote: > > Hi, > > Indeed the answer is negative and also, many people have often asked > this in this list. Martin has very nicely explained the problems and > possible solution. I'll just add what I have thought of. I have often > wondered what it would take to create a nice configurable cookie based > authentication feature. The following file would be needed:- > > - A configuration file with a list of domains for which authentication > should be done along with the login URL and POST data. > > The behaviour of the fetcher should be as follows:- > > 1 - If the URL to be fetched is from a domain in the config file AND > this is the first time the fetcher is going to hit the domain, it > should first send the POST data to the login URL mentioned in the POST > data configuration file for that domain and obtain the session > cookies. > 2 - Save the cookies. (protocol-httpclient does this for a single fetch > cycle). > 3 - Request the actual URL that was supposed to be fetched. > 4 - For further URLs from the same domain, authentication need not be > done for the same fetch cycle. > 5.1 - However, if the URL redirects the page to the login URL (which > can be checked from the configuration file) authentication should be > done again, OR, > 5.2 - if the URL returns an error page, authentication should be done > again. > > The cookie management part is already taken care of by the HttpClient > library. I find point no. 5.2 difficult to crack. The situation in 5.1 > and 5.2 may occur only when the session times out in the same fetch > crawl. The authentication failure page may be returned as HTTP 200 OK > status which makes it more difficult. I could think of three ways to > solve this and I don't like the first two:- > > 1. We use regex match to find out whether the contents of the page > indicates it as an authentication failure page or not. But it is an > unnecessary waste of time because for most cases the page wouldn't be > an error page. > 2. We perform an authentication by sending POST data to login URL > every time we fetch a page from that domain. By this we are almost > doubling the bandwidth requirement to crawl that website. > 3. For those sites, where authentication failure page comes from a > known URL, we can add which URLs mean authentication failure along > with the login URL and POST data in the configuration file. There > wouldn't be too many such URLs for a particular domain and so a regex > match or a complete string match for the URLs after every response > from that domain shouldn't consume much time. > > However, even without taking care of 5.1 and 5.2 we can have a > solution that may be useful to many. The solution for this problem is > very different from Http Authentication Schemes that I submitted as > NUTCH-559 because in NUTCH-559, the whole job of authentication can be > done within the protocol-httpclient plugin. However, in this, some job > has to be done in the fetcher, outside the plugin also. > > If I get some free time, I'll try to work on this. > > Regards, > Susam Pal > > On Jan 6, 2008 12:11 AM, Martin Kuen <[EMAIL PROTECTED]> wrote: >> Hi, >> >> I >> >> On Jan 5, 2008 6:50 PM, <[EMAIL PROTECTED]> wrote: >> > Hi, >> > >> > I'm pretty sure the answer is negative, but I've got to ask - is >> support for form-based authentication available somewhere within Nutch? >> > I believe Nutch does not support form-based auth, so the next question >> to ask is - is there a suitable place to plug this in? >> >> You're right - not available . . . You can read/see what is available >> under http://wiki.apache.org/nutch/HttpAuthenticationSchemes >> >> I have not looked into this closely yet, but maybe some of you already >> went through this in your own Nutch-based projects. I am imagining a >> system where one would have a file with a bunch of username + password >> pairs + a form submission URL (see P.S. below), Nutch would read that >> and, before GETing a page from a matching site, it would POST the >> username+password via the form submission URL, get the cookie for the >> session, keep it stored somewhere and keep sending it back on >> subsequent GET requests. >> > >> >> I think you are talking about container managed security. >> Unfortunatly, this approach will not work . . . read on below >> >> > I imagine this gets messy pretty quickly (e.g. session expires even >> though cookie is still valid, so how is Nutch to detect this? How to >> patch URLs to possible multiple sub-sections of a site that require one >> or more form-based auth mechanisms? etc.), but again, maybe others have >> already done some thinking in this space. >> > >> > Thanks, >> > Otis >> > P.S. >> > Example file with info for form-based auth might look something like >> this: >> > u=username1 p=password1 http://site1.com/cgi-bin/login.cgi? >> > username=userX password=passX >> http://www.example.com/foo/bar.do?button=q& >> > >> >> The following is true for Container Managed Authentication: >> - You cannot directly acess the login page - There will be an error >> message saying "Invalid direct reference to login page" >> - If you acess a protected resource and you're not authenticated >> tomcat will issue 302 redirect to the login page. The original request >> will be saved inside the session.The browser follows the redirect and >> displays the login page. Then you can type in your uname + pwd and if >> everything is right you'll be redirected to the previously acessed >> resource (the previous/original request which is stored in the >> session). >> >> For me this boils down to: >> - Acess a (probably) protected resource. >> - Intercept the redirection and find out if it's one of the configured >> login pages. Alternativly one could try to find the "j_security_check" >> (,j_username, j_password) string. I am not sure about the exact >> spelling. >> - Do a POST to the login form using uname and pwd from config file >> - Follow the redirection and/or detect a redirect to the error-page >> (therefore, I think you need login AND error page in your config file) >> >> I am not sure if you can achieve these things by "just" writing a plugin. >> Have a look at the Fetcher class (I think >> org.apache.nutch.fetcher.Fetcher). There you find the code which deals >> with http redirections. >> Have a look the http-protocol plugin. >> However, you also may want to have a look at the authentication code >> in trunk by Susam Pal >> >> These are the three places where I imagine you'll find what you need. >> >> >> Hope this helps and good luck, >> >> Martin >> > > -- View this message in context: http://www.nabble.com/form-based-authentication--tp14636603p14857152.html Sent from the Nutch - User mailing list archive at Nabble.com.