Hi, 

This would be a great feature to have, it's just what I need! Has any
progress been made? I'd be more than happy to test out a beta version if I
can. 

Cheers, 
Iwan




Susam Pal wrote:
> 
> Hi,
> 
> Indeed the answer is negative and also, many people have often asked
> this in this list. Martin has very nicely explained the problems and
> possible solution. I'll just add what I have thought of. I have often
> wondered what it would take to create a nice configurable cookie based
> authentication feature. The following file would be needed:-
> 
> - A configuration file with a list of domains for which authentication
> should be done along with the login URL and POST data.
> 
> The behaviour of the fetcher should be as follows:-
> 
> 1 - If the URL to be fetched is from a domain in the config file AND
> this is the first time the fetcher is going to hit the domain, it
> should first send the POST data to the login URL mentioned in the POST
> data configuration file for that domain and obtain the session
> cookies.
> 2 - Save the cookies. (protocol-httpclient does this for a single fetch
> cycle).
> 3 - Request the actual URL that was supposed to be fetched.
> 4 - For further URLs from the same domain, authentication need not be
> done for the same fetch cycle.
> 5.1 - However, if the URL redirects the page to the login URL (which
> can be checked from the configuration file) authentication should be
> done again, OR,
> 5.2 - if the URL returns an error page, authentication should be done
> again.
> 
> The cookie management part is already taken care of by the HttpClient
> library. I find point no. 5.2 difficult to crack. The situation in 5.1
> and 5.2 may occur only when the session times out in the same fetch
> crawl. The authentication failure page may be returned as HTTP 200 OK
> status which makes it more difficult. I could think of three ways to
> solve this and I don't like the first two:-
> 
> 1. We use regex match to find out whether the contents of the page
> indicates it as an authentication failure page or not. But it is an
> unnecessary waste of time because for most cases the page wouldn't be
> an error page.
> 2. We perform an authentication by sending POST data to login URL
> every time we fetch a page from that domain. By this we are almost
> doubling the bandwidth requirement to crawl that website.
> 3. For those sites, where authentication failure page comes from a
> known URL, we can add which URLs mean authentication failure along
> with the login URL and POST data in the configuration file. There
> wouldn't be too many such URLs for a particular domain and so a regex
> match or a complete string match for the URLs after every response
> from that domain shouldn't consume much time.
> 
> However, even without taking care of 5.1 and 5.2 we can have a
> solution that may be useful to many. The solution for this problem is
> very different from Http Authentication Schemes that I submitted as
> NUTCH-559 because in NUTCH-559, the whole job of authentication can be
> done within the protocol-httpclient plugin. However, in this, some job
> has to be done in the fetcher, outside the plugin also.
> 
> If I get some free time, I'll try to work on this.
> 
> Regards,
> Susam Pal
> 
> On Jan 6, 2008 12:11 AM, Martin Kuen <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> I
>>
>> On Jan 5, 2008 6:50 PM,  <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> >
>> > I'm pretty sure the answer is negative, but I've got to ask - is
>> support for form-based authentication available somewhere within Nutch?
>> > I believe Nutch does not support form-based auth, so the next question
>> to ask is - is there a suitable place to plug this in?
>>
>> You're right - not available . . . You can read/see what is available
>> under http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>>
>> I have not looked into this closely yet, but maybe some of you already
>> went through this in your own Nutch-based projects.  I am imagining a
>> system where one would have a file with a bunch of username + password
>> pairs + a form submission URL (see P.S. below), Nutch would read that
>> and, before GETing a page from a matching site, it would POST the
>> username+password via the form submission URL, get the cookie for the
>> session, keep it stored somewhere and keep sending it back on
>> subsequent GET requests.
>> >
>>
>> I think you are talking about container managed security.
>> Unfortunatly, this approach will not work . . . read on below
>>
>> > I imagine this gets messy pretty quickly (e.g. session expires even
>> though cookie is still valid, so how is Nutch to detect this?  How to
>> patch URLs to possible multiple sub-sections of a site that require one
>> or more form-based auth mechanisms? etc.), but again, maybe others have
>> already done some thinking in this space.
>> >
>> > Thanks,
>> > Otis
>> > P.S.
>> > Example file with info for form-based auth might look something like
>> this:
>> > u=username1    p=password1    http://site1.com/cgi-bin/login.cgi?
>> > username=userX    password=passX   
>> http://www.example.com/foo/bar.do?button=q&;
>> >
>>
>> The following is true for Container Managed Authentication:
>> - You cannot directly acess the login page - There will be an error
>> message saying "Invalid direct reference to login page"
>> - If you acess a protected resource and you're not authenticated
>> tomcat will issue 302 redirect to the login page. The original request
>> will be saved inside the session.The browser follows the redirect and
>> displays the login page. Then you can type in your uname + pwd and if
>> everything is right you'll be redirected to the previously acessed
>> resource (the previous/original request which is stored in the
>> session).
>>
>> For me this boils down to:
>> - Acess a (probably) protected resource.
>> - Intercept the redirection and find out if it's one of the configured
>> login pages. Alternativly one could try to find the "j_security_check"
>> (,j_username, j_password) string. I am not sure about the exact
>> spelling.
>> - Do a POST to the login form using uname and pwd from config file
>> - Follow the redirection and/or detect a redirect to the error-page
>> (therefore, I think you need login AND error page in your config file)
>>
>> I am not sure if you can achieve these things by "just" writing a plugin.
>> Have a look at the Fetcher class (I think
>> org.apache.nutch.fetcher.Fetcher). There you find the code which deals
>> with http redirections.
>> Have a look the http-protocol plugin.
>> However, you also may want to have a look at the authentication code
>> in trunk by Susam Pal
>>
>> These are the three places where I imagine you'll find what you need.
>>
>>
>> Hope this helps and good luck,
>>
>> Martin
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/form-based-authentication--tp14636603p14857152.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to