Re: Mailing List nutch-agent Reports of Bots Submitting Forms

Andrzej Bialecki Wed, 24 May 2006 15:36:49 -0700

Ken Krugler wrote:

Jeremy Bensley wrote:
There are posts every three or four days to the nutch-agentregarding botssubmitting empty forms to websites. I don't think I've seen anyregular devs
reply in-list to these issues, and am just wondering if these cases are
being analyzed.
1. Is there a known (resolved or current) bug regarding Nutchsubmittingforms? I could find no bug listings in JIRA for this. If it isknown and
resolved, what versions of the bot exhibit this behavior?
Yes, there was a discussion on the list about this - I'm afraid thisbehavior is present in both 0.7.x and 0.8. I'm going to remove theoffending code (or to put it as an option, turned off by default).
I think the biggest issue is following links for a form POST. Thisdefinitely seems wrong to me, and thus should never be done.

I don't think this is happening anymore, there is an explicit check forPOST method in DOMContentUtils that should prevent this. However, somehorribly broken HTML may be fooling Neko or TagSoup, so that they losethe 'method' attribute (in which case it defaults to GET).

There's a separate issue re whether it's OK to follow form links thatdo a GET, since that's what the guy complained to us about recently.He agreed that his form should be doing a POST, since it triggers amassive build process, but he also said that no other crawl besidesNutch was following these links.
I could see making that a configurable option, where it was false bydefault. But we'd probably need to modify this setting to bedomain-specific, ie some sites we crawl require us to follow thesetypes of links to get at content, but in general we'd want to notfollow them.

For now I modified the code to skip form action URLs, depending on aboolean option. I'll commit this in a moment.

This brings up an issue I've been thinking about. It might make senseto require everybody set the user-agent string, versus it havingdefault values that point to Nutch.
The first time you run Nutch, it would display an error re theuser-agent string not being set, but if the instructions for how to dothis were explicit, this wouldn't be much of a hardship for anybodytrying it out.
I could write up some quick text for the Wiki re what a good useragent string should contain, and what should be on the web page thatit refers to, since we also went through that same process not toolong ago.

I like this idea. I know that I've been guilty of this in the past, outof pure laziness ...


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Mailing List nutch-agent Reports of Bots Submitting Forms

Reply via email to