Re: [fossil-users] Help improve bot exclusion

Steve Havelka Tue, 30 Oct 2012 09:28:05 -0700

My guess is that you don't really want to filter out bots, specifically,
but really anyone who's attempting to hit every link Fossil makes--that
is to say, it's the behavior that we're trying to stop here, not the actor.


I suppose what I'd do is set up a mechanism to detect when the remote
user is pulling down data too quickly to be a bot/non-abusive person,
and when Fossil detects that, send back a blank "Whoa, nellie!  slow
down, human!" page for a minute or five.

I'd allow the user to configure two thresholds, number of pages per
second to trigger this, and number of seconds within a five-minute
window that the "number of pages per seconds" threshold is exceeded. 
I'd give them defaults of "3 pages per second" and "3 times in five
minutes".  So, for example, if a user hits 3 links in one second, which
can happen if you know exactly where you're going and the repository
loads quickly, it's ok the first time, even the second, but the third
time, it locks you out of the web interface for a little while.

Command-line stuff, like cloning/push/pull actions, ought to remain
accessible under all circumstances, regardless of the activity on the
web UI.

What do you think?



On 10/30/2012 03:17 AM, Richard Hipp wrote:
> A Fossil website for a project with a few thousand check-ins can have
> a lot of hyperlinks.  If a spider or bot starts to walk that site, it
> will visit literally hundreds of thousand or perhaps millions of
> pages, many of which are things like "vdiff" and "annotate" which are
> computationally expensive to generate or like "zip" or "tarball" which
> give multi-megabyte replies.  If you get a lot of bots walking a
> Fossil site, it can really load down the CPU and run up bandwidth charges.
>
> To prevent this, Fossil uses bot-exclustion techniques.  First it
> looks at the USER_AGENT string in the HTTP header and uses that to
> distinguish bots from humans.  Of course, a USER_AGENT string is
> easily forged, but most bots are honest about who they are so this is
> a good initial filter.  (The undocumented "fossil test-ishuman"
> command can be used to experiment with this bot discriminator.)
>
> The second line of defense is that hyperlinks are disabled in the
> transmitted HTML.  There is no href= attribute on the <a> tags.  The
> href= attributes are added by javascript code that runs after the page
> has been loaded.  The idea here is that a bot can easily forge a
> USER_AGENT string, but running javascript code is a bit more work and
> even malicious bots don't normally go to that kind of trouble.
>
> So, then, to walk a Fossil website, an agent has to (1) present a
> USER_AGENT string from a known friendly web browser and (2) interpret
> Javascript.
>
> This two-phase defense against bots is usually effective.  But last
> night, a couple of bots got through on the SQLite website.  No great
> damage was done as we have ample bandwidth and CPU reserves to handle
> this sort of thing.  Even so, I'd like to understand how they got
> through so that I might improve Fossil's defenses.
>
> The first run on the SQLite website originated in Chantilly, VA and
> gave a USER_AGENT string as follows:
>
>     Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64;
> Trident/5.0; SLCC2; .NET_CLR 2.0.50727; .NET_CLR 3.5.30729; .NET_CLR
> 3.0.30729; Media_Center_PC 6.0; .NET4.0C; WebMoney_Advisor; MS-RTC_LM_8)
>
> The second run came from Berlin and gives this USER_AGENT:
>
>     Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
>
> Both sessions started out innocently.  The logs suggest that there
> really was a human operator initially.  But then after about 3 minutes
> of "normal" browsing, each session starts downloading every hyperlink
> in sight at a rate of about 5 to 10 pages per second.  It is as if the
> user had pressed a "Download Entire Website" button on their browser. 
> Question:  Is there such a button in IE?
>
> Another question:  Are significant numbers of people still using IE6
> and IE7?  Could we simply change Fossil to consider IE prior to
> version 8 to be a bot, and hence not display any hyperlinks until the
> user has logged in?
>
> Yet another question:  Is there any other software on Windows that I
> am not aware of that might be causing the above behaviors?  Are there
> plug-ins or other tools for IE that will walk a website and download
> all its content?
>
> Finally: Do you have any further ideas on how to defend a Fossil
> website against runs such as the two we observed on SQLite last night?
>
> Tnx for the feedback....
> -- 
> D. Richard Hipp
> d...@sqlite.org <mailto:d...@sqlite.org>
>
>
> _______________________________________________
> fossil-users mailing list
> fossil-users@lists.fossil-scm.org
> http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Re: [fossil-users] Help improve bot exclusion

Reply via email to