Re: [fossil-users] Help improve bot exclusion

2012-10-30 Thread Arjen Markus

On Tue, 30 Oct 2012 06:17:05 -0400
 Richard Hipp d...@sqlite.org wrote:



This two-phase defense against bots is usually 
effective.  But last night,
a couple of bots got through on the SQLite website.  No 
great damage was
done as we have ample bandwidth and CPU reserves to 
handle this sort of
thing.  Even so, I'd like to understand how they got 
through so that I

might improve Fossil's defenses.

The first run on the SQLite website originated in 
Chantilly, VA and gave a

USER_AGENT string as follows:

   Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; 
WOW64; Trident/5.0;
SLCC2; .NET_CLR 2.0.50727; .NET_CLR 3.5.30729; .NET_CLR 
3.0.30729;
Media_Center_PC 6.0; .NET4.0C; WebMoney_Advisor; 
MS-RTC_LM_8)


The second run came from Berlin and gives this 
USER_AGENT:


   Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

Both sessions started out innocently.  The logs suggest 
that there really
was a human operator initially.  But then after about 3 
minutes of normal
browsing, each session starts downloading every 
hyperlink in sight at a
rate of about 5 to 10 pages per second.  It is as if the 
user had pressed a
Download Entire Website button on their browser. 
Question:  Is there

such a button in IE?


I just tried it: you can save a URL as a single web page 
or a web archive (extension .wht,
whatever that means). So it seems quite possible - and it 
appears to be the default when

using save as.

This was with IE 8.

Regards,

Arjen



DISCLAIMER: This message is intended exclusively for the addressee(s) and may 
contain confidential and privileged information. If you are not the intended 
recipient please notify the sender immediately and destroy this message. 
Unauthorized use, disclosure or copying of this message is strictly prohibited.
The foundation 'Stichting Deltares', which has its seat at Delft, The 
Netherlands, Commercial Registration Number 41146461, is not liable in any way 
whatsoever for consequences and/or damages resulting from the improper, 
incomplete and untimely dispatch, receipt and/or content of this e-mail.




___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Help improve bot exclusion

2012-10-30 Thread Lluís Batlle i Rossell
On Tue, Oct 30, 2012 at 06:17:05AM -0400, Richard Hipp wrote:
 Finally: Do you have any further ideas on how to defend a Fossil website
 against runs such as the two we observed on SQLite last night?

This problem affects almost any web software, and I think that job is delegated
to robots.txt. Isn't this approach good enough? And in the particular case of
the fossil standalone server, it could serve a robots.txt.

How do programs like 'viewcvs' or 'viewsvn' deal with that?

Regards,
Lluís.
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Help improve bot exclusion

2012-10-30 Thread Kees Nuyt
On Tue, 30 Oct 2012 06:17:05 -0400, Richard Hipp d...@sqlite.org wrote:

[...]

 Both sessions started out innocently.  The logs suggest that there really
 was a human operator initially.  But then after about 3 minutes of normal
 browsing, each session starts downloading every hyperlink in sight at a
 rate of about 5 to 10 pages per second. It is as if the user had pressed a
 Download Entire Website button on their browser.  Question:  Is there
 such a button in IE?

No, just save page as  It will not follow hyperlinks, only save
html and embedded resources, like images.

 Another question:  Are significant numbers of people still using IE6 and
 IE7?  Could we simply change Fossil to consider IE prior to version 8 to be
 a bot, and hence not display any hyperlinks until the user has logged in?

I don't think it would help much. Newer versions will potentially run
the same add-ons.

By the way, over 5% of the population still use these older versions.
http://stats.wikimedia.org/archive/squid_reports/2012-09/SquidReportClients.htm

 Yet another question:  Is there any other software on Windows that I am not
 aware of that might be causing the above behaviors?  Are there plug-ins or
 other tools for IE that will walk a website and download all its content?

There are several browser add-ons that will try to walk complete
websites, e.g.:
http://www.winappslist.com/download_managers.htm
http://www.unixdaemon.net/ie-plugins.html

One can also think of validator tools.

Standalone programs usually will not run javascript.


 Finally: Do you have any further ideas on how to defend a Fossil website
 against runs such as the two we observed on SQLite last night?

Perhaps the href javascript should run onfocus, rather than onload?
(untested)

Other defenses could use DoS defense techniques, like not honouring (or
agressively delay responses to) more than a certain number of requests
within a certain time, which is not nice, because the server would have
to maintain (more) session state.

Sidenote:
As far as I can tell several modern browsers have a read ahead option,
that will try to load more pages of the site before a link is clicked.
https://developers.google.com/chrome/whitepapers/prerender
Those will not walk a whole site though.

-- 
Groet, Cordialement, Pozdrawiam, Regards,

Kees Nuyt

___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Help improve bot exclusion

2012-10-30 Thread Richard Hipp
On Tue, Oct 30, 2012 at 6:23 AM, Lluís Batlle i Rossell vi...@viric.namewrote:

 On Tue, Oct 30, 2012 at 06:17:05AM -0400, Richard Hipp wrote:
  Finally: Do you have any further ideas on how to defend a Fossil website
  against runs such as the two we observed on SQLite last night?

 This problem affects almost any web software, and I think that job is
 delegated
 to robots.txt. Isn't this approach good enough?


Robots.txt only works over an entire domain.  If your Fossil server is
running as CGI within that domain, you can manually modify your robots.txt
file to exclude all or part of the fossil URI space.  But as that file is
not under control of Fossil, you have to make this configuration yourself -
Fossil cannot help you.  This burden can become acute when you are managing
many dozens or even hundreds of Fossil repositories.  An automatic system
is better.



 And in the particular case of
 the fossil standalone server, it could serve a robots.txt.

 How do programs like 'viewcvs' or 'viewsvn' deal with that?

 Regards,
 Lluís.
 ___
 fossil-users mailing list
 fossil-users@lists.fossil-scm.org
 http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users




-- 
D. Richard Hipp
d...@sqlite.org
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Help improve bot exclusion

2012-10-30 Thread Bernd Paysan
Am Dienstag, 30. Oktober 2012, 08:20:14 schrieb Richard Hipp:
 On Tue, Oct 30, 2012 at 6:23 AM, Lluís Batlle i Rossell
vi...@viric.namewrote:
  On Tue, Oct 30, 2012 at 06:17:05AM -0400, Richard Hipp wrote:
   Finally: Do you have any further ideas on how to defend a Fossil website
   against runs such as the two we observed on SQLite last night?
 
  This problem affects almost any web software, and I think that job is
  delegated
  to robots.txt. Isn't this approach good enough?

 Robots.txt only works over an entire domain.  If your Fossil server is
 running as CGI within that domain, you can manually modify your robots.txt
 file to exclude all or part of the fossil URI space.  But as that file is
 not under control of Fossil, you have to make this configuration yourself -
 Fossil cannot help you.  This burden can become acute when you are managing
 many dozens or even hundreds of Fossil repositories.  An automatic system
 is better.

The search engine crawlers do honor the robots meta-tag:

http://www.robotstxt.org/meta.html

Adding this is a piece of cake (just change the page template), but it doesn't
help against malware.

--
Bernd Paysan
If you want it done right, you have to do it yourself
http://bernd-paysan.de/


signature.asc
Description: This is a digitally signed message part.
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Help improve bot exclusion

2012-10-30 Thread Kees Nuyt
[Default] On Tue, 30 Oct 2012 06:17:05 -0400, Richard Hipp
d...@sqlite.org wrote:

 Finally: Do you have any further ideas on how to defend a Fossil website
 against runs such as the two we observed on SQLite last night?

Another suggestion:
Include a (mostly invisible, perhaps hard to recognize) logout hyperlink
on every page that immediately invalidates the session if it is
followed. Users will not see it and not be bothered by it, bots will
stumble upon it.

-- 
Groet, Cordialement, Pozdrawiam, Regards,

Kees Nuyt

___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Help improve bot exclusion

2012-10-30 Thread Kees Nuyt
[Default] On Tue, 30 Oct 2012 10:13:47 -0500, Nolan Darilek
no...@thewordnerd.info wrote:

 And, most importantly, don't sacrifice accessibility in the name of 
 excluding bots. Mouseover links are notoriously inaccessible. Same with 
 only adding href on focus via JS rather than on page load. If I tab 
 through a page, that would seem to break keyboard navigation.

I agree.
I should have been more explicit: run the script when body gets focus,
not per hyperlink.

-- 
Groet, Cordialement, Pozdrawiam, Regards,

Kees Nuyt

___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Help improve bot exclusion

2012-10-30 Thread Steve Havelka
My guess is that you don't really want to filter out bots, specifically,
but really anyone who's attempting to hit every link Fossil makes--that
is to say, it's the behavior that we're trying to stop here, not the actor.

I suppose what I'd do is set up a mechanism to detect when the remote
user is pulling down data too quickly to be a bot/non-abusive person,
and when Fossil detects that, send back a blank Whoa, nellie!  slow
down, human! page for a minute or five.

I'd allow the user to configure two thresholds, number of pages per
second to trigger this, and number of seconds within a five-minute
window that the number of pages per seconds threshold is exceeded. 
I'd give them defaults of 3 pages per second and 3 times in five
minutes.  So, for example, if a user hits 3 links in one second, which
can happen if you know exactly where you're going and the repository
loads quickly, it's ok the first time, even the second, but the third
time, it locks you out of the web interface for a little while.

Command-line stuff, like cloning/push/pull actions, ought to remain
accessible under all circumstances, regardless of the activity on the
web UI.

What do you think?



On 10/30/2012 03:17 AM, Richard Hipp wrote:
 A Fossil website for a project with a few thousand check-ins can have
 a lot of hyperlinks.  If a spider or bot starts to walk that site, it
 will visit literally hundreds of thousand or perhaps millions of
 pages, many of which are things like vdiff and annotate which are
 computationally expensive to generate or like zip or tarball which
 give multi-megabyte replies.  If you get a lot of bots walking a
 Fossil site, it can really load down the CPU and run up bandwidth charges.

 To prevent this, Fossil uses bot-exclustion techniques.  First it
 looks at the USER_AGENT string in the HTTP header and uses that to
 distinguish bots from humans.  Of course, a USER_AGENT string is
 easily forged, but most bots are honest about who they are so this is
 a good initial filter.  (The undocumented fossil test-ishuman
 command can be used to experiment with this bot discriminator.)

 The second line of defense is that hyperlinks are disabled in the
 transmitted HTML.  There is no href= attribute on the a tags.  The
 href= attributes are added by javascript code that runs after the page
 has been loaded.  The idea here is that a bot can easily forge a
 USER_AGENT string, but running javascript code is a bit more work and
 even malicious bots don't normally go to that kind of trouble.

 So, then, to walk a Fossil website, an agent has to (1) present a
 USER_AGENT string from a known friendly web browser and (2) interpret
 Javascript.

 This two-phase defense against bots is usually effective.  But last
 night, a couple of bots got through on the SQLite website.  No great
 damage was done as we have ample bandwidth and CPU reserves to handle
 this sort of thing.  Even so, I'd like to understand how they got
 through so that I might improve Fossil's defenses.

 The first run on the SQLite website originated in Chantilly, VA and
 gave a USER_AGENT string as follows:

 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64;
 Trident/5.0; SLCC2; .NET_CLR 2.0.50727; .NET_CLR 3.5.30729; .NET_CLR
 3.0.30729; Media_Center_PC 6.0; .NET4.0C; WebMoney_Advisor; MS-RTC_LM_8)

 The second run came from Berlin and gives this USER_AGENT:

 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

 Both sessions started out innocently.  The logs suggest that there
 really was a human operator initially.  But then after about 3 minutes
 of normal browsing, each session starts downloading every hyperlink
 in sight at a rate of about 5 to 10 pages per second.  It is as if the
 user had pressed a Download Entire Website button on their browser. 
 Question:  Is there such a button in IE?

 Another question:  Are significant numbers of people still using IE6
 and IE7?  Could we simply change Fossil to consider IE prior to
 version 8 to be a bot, and hence not display any hyperlinks until the
 user has logged in?

 Yet another question:  Is there any other software on Windows that I
 am not aware of that might be causing the above behaviors?  Are there
 plug-ins or other tools for IE that will walk a website and download
 all its content?

 Finally: Do you have any further ideas on how to defend a Fossil
 website against runs such as the two we observed on SQLite last night?

 Tnx for the feedback
 -- 
 D. Richard Hipp
 d...@sqlite.org mailto:d...@sqlite.org


 ___
 fossil-users mailing list
 fossil-users@lists.fossil-scm.org
 http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users