Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpPostAuthentication

New page:
== Introduction ==
Often, Nutch has to crawl websites with pages protected by authentication. 
Therefore, to crawl such web-pages, Nutch must authenticate itself to the 
website and then proceed with fetching the pages from it. Currently, the 
development version of Nutch can do Basic, Digest and NTLM] based 
authentication. This is documented in HttpAuthenticationSchemes. In this 
project, we would be adding HTTP POST based authentication, which is the most 
popular form of authentication on most websites. It should be possible to 
configure different credentials for different websites.

== Configuration ==
A configuration file with a list of domains for which authentication should be 
done along with the login URL and POST data. If possible, the configuration 
should also allow the user to mention a session timeout value for websites as 
an optional parameter. This would be helpful if some website is known to 
timeout very quickly, or when the duration of the fetch cycle would be too long 
as compared to the session's life.

== Behavior of the fetcher ==
 1. If the URL to be fetched is from a domain in the config file AND this is 
the first time the fetcher is going to hit the domain, it should first send the 
POST data to the login URL mentioned in the POST data configuration file for 
that domain and obtain the session cookies.
 1. Save the cookies. (protocol-httpclient does this for a single fetch cycle, 
so this should be handled automatically).
 1. Request the actual URL that was supposed to be fetched.
 1. For further URLs from the same domain, authentication need not be done for 
the same fetch cycle.

However, there should be some exceptions to the last behavior, since a server 
session may expire the session, before our fetch cycle is complete. This would 
be a problem, if the crawler hits the website again later in a fetch cycle 
after the session has expired. Therefore, the exceptions to the last rule 
should be made if one of the following conditions is met:

 1. If the URL redirects the page to the login URL (which can be checked from 
the configuration file) authentication should be done again.
 1. If the URL returns an error page, authentication should be done again.
 1. If the time elapsed after the last fetch from the website is more than the 
session timeout specified for the website, authentication should be done again.

== Some challenges discussed in the mailing list ==
The authentication failure page may be returned as HTTP 200 OK status which 
makes it more difficult. Three possible ways to solve it:-

 1. We use pattern matching to find out whether the contents of the page 
indicates it as an authentication failure page or not, for the website. But it 
is an unnecessary waste of time because for most cases the page wouldn't be an 
error page.
 1. We perform an authentication by sending POST data to login URL every time 
we fetch a page from that domain. By this, we are almost doubling the bandwidth 
requirement to crawl that website.
 1. For those sites, where authentication failure page comes from a known URL, 
we can add which URLs mean authentication failure along with the login URL and 
POST data in the configuration file. There wouldn't be too many such URLs for a 
particular domain and so a regex match or a complete string match for the URLs 
after every response
from that domain shouldn't consume much time.

However, even without taking care of these points, and simply getting the 
fetcher behavior right as discussed in the previous section, we'll have a 
solution that may be useful to many.

== Original discussion in the mailing list ==
http://www.mail-archive.com/[EMAIL PROTECTED]/msg10248.html

Reply via email to