The problem is the assumption that there is some sort of information that
can be retrieved by submitting information into a form. Many times this is
the case, but there are instances ('Contact Me', 'Subscribe to Mailing
List', etc.) where the results of the submission are actually processed on
the backend and the only 'information' the crawler would see is a success or
failure message.The nutch-agent mailing list was getting many complaints about nutch bots submitting empty fields, and this instigated the change in the nutch behavior. On 6/22/06, bruce <[EMAIL PROTECTED]> wrote:
hi... suppose i simply want to capture 'all' the information/text/html from a site if i want to mirror the site at this exact moment.. then i'd want to capture the forms (both GET/POST actions) are you saying that nutch wouldn't/shouldn't do this? -bruce -----Original Message----- From: Fuad Efendi [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 21, 2006 9:41 PM To: [email protected] Subject: RE: following forms using nutch... According to HTTP/1.1 specs, POST method, page 55, RFC 2616: "Responses to this method are not cacheable, unless the response includes appropriate Cache-Control or Expires header fields." http://www.ietf.org/rfc/rfc2616.txt So, Nutch _should_not_ store anywhere information retrieved via POST... Web-Developers _expect_ that such pages won't be cached... Suppose we have a form on a forum (or simple 'Contact Me' form), will Nutch post dummy messages? It was fixed as a bug, and Nutch does not follow 'post' anymore (I believe...) Thanks -----Original Message----- From: Honda-Search Administrator Bruce, There is no reason you shouldn't be able to use POST, especially if you use the opensearch method to display your results. Matt ----- Original Message ----- From: "bruce" > hi... > > not sure whether this should be a dev/user question... > > some of the archives seem to indicate that nutch doesn't/can't/perhaps > shouldn't follow a form that uses POST... is this correct, and if it is, > can > someone tell me why? > > can nutch hand forms that use GET?? > > i'm looking to extract some information off of public college sites, and > some of the sites use POST, while others use GET with their forms... > > thanks > > -bruce
