The problem is the assumption that there is some sort of information that
can be retrieved by submitting information into a form. Many times this is
the case, but there are instances ('Contact Me', 'Subscribe to Mailing
List', etc.) where the results of the submission are actually processed on
the backend and the only 'information' the crawler would see is a success or
failure message.
The nutch-agent mailing list was getting many complaints about nutch bots
submitting empty fields, and this instigated the change in the nutch
behavior.
On 6/22/06, bruce <[EMAIL PROTECTED]> wrote:
hi...
suppose i simply want to capture 'all' the information/text/html from a
site
if i want to mirror the site at this exact moment.. then i'd want to
capture
the forms (both GET/POST actions) are you saying that nutch
wouldn't/shouldn't do this?
-bruce
-----Original Message-----
From: Fuad Efendi [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 21, 2006 9:41 PM
To: [email protected]
Subject: RE: following forms using nutch...
According to HTTP/1.1 specs,
POST method, page 55, RFC 2616:
"Responses to this method are not cacheable, unless the response
includes appropriate Cache-Control or Expires header fields."
http://www.ietf.org/rfc/rfc2616.txt
So, Nutch _should_not_ store anywhere information retrieved via POST...
Web-Developers _expect_ that such pages won't be cached...
Suppose we have a form on a forum (or simple 'Contact Me' form), will
Nutch
post dummy messages? It was fixed as a bug, and Nutch does not follow
'post'
anymore (I believe...)
Thanks
-----Original Message-----
From: Honda-Search Administrator
Bruce,
There is no reason you shouldn't be able to use POST, especially if you
use
the opensearch method to display your results.
Matt
----- Original Message -----
From: "bruce"
> hi...
>
> not sure whether this should be a dev/user question...
>
> some of the archives seem to indicate that nutch doesn't/can't/perhaps
> shouldn't follow a form that uses POST... is this correct, and if it is,
> can
> someone tell me why?
>
> can nutch hand forms that use GET??
>
> i'm looking to extract some information off of public college sites, and
> some of the sites use POST, while others use GET with their forms...
>
> thanks
>
> -bruce
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general