And please note the mail from Doug on Nov 23. --------------------------------------------------------------------------------------------- Title: [Fwd: Spider Causing Contact Form Submissions] Body: It looks as though Nutch is inadvertantly submitting forms.
At DOMContentUtils.java:58 we specify that the "action" parameter of an HTML form should be extracted as a link. Yet we ignore the "method" parameter of the form. I think we should only follow these when the method is "get", not when it is "post". Do others agree? Doug ------------------------------------------------------------------------------------------- I think the source code in svn ignore the POST url now . /Jack On 12/14/05, Jack Tang <[EMAIL PROTECTED]> wrote: > Hi > > You can read the article about Stanford's HiWE search engine on www10.org. > And it is easy to extend Nutch if you are using http-client protocol. > > http://www10.org/cdrom/posters/p1049/ > > Good luck:) > > /Jack > > On 12/14/05, Andy Read <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I'm using nutch to create a site search facility for a couple of site. > > > > I upgraded from 0.6 to 0.7.1 a few days ago and have just noticed that blank > > users are being registered on my site at the exact times the cron job runs > > the crawl tool to re-index the site. This means that the crawler is now > > submitting a post request from the registration form! Is this a new > > 'feature' of 0.7 or 0.7.1? I can't find any mention in changes.txt and I > > can't find any config option referring to it. Surely the crawler should > > never submit form input? > > > > Any help appreciated. > > > > Thanks, > > > > Andy Read > > > > www.azurite.co.uk > > > > > > > > > > > -- > Keep Discovering ... ... > http://www.jroller.com/page/jmars > -- Keep Discovering ... ... http://www.jroller.com/page/jmars