Re: [fossil-users] Help improve bot exclusion
On Tue, 30 Oct 2012 06:17:05 -0400 Richard Hipp d...@sqlite.org wrote: This two-phase defense against bots is usually effective. But last night, a couple of bots got through on the SQLite website. No great damage was done as we have ample bandwidth and CPU reserves to handle this sort of thing. Even so, I'd like to understand how they got through so that I might improve Fossil's defenses. The first run on the SQLite website originated in Chantilly, VA and gave a USER_AGENT string as follows: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET_CLR 2.0.50727; .NET_CLR 3.5.30729; .NET_CLR 3.0.30729; Media_Center_PC 6.0; .NET4.0C; WebMoney_Advisor; MS-RTC_LM_8) The second run came from Berlin and gives this USER_AGENT: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Both sessions started out innocently. The logs suggest that there really was a human operator initially. But then after about 3 minutes of normal browsing, each session starts downloading every hyperlink in sight at a rate of about 5 to 10 pages per second. It is as if the user had pressed a Download Entire Website button on their browser. Question: Is there such a button in IE? I just tried it: you can save a URL as a single web page or a web archive (extension .wht, whatever that means). So it seems quite possible - and it appears to be the default when using save as. This was with IE 8. Regards, Arjen DISCLAIMER: This message is intended exclusively for the addressee(s) and may contain confidential and privileged information. If you are not the intended recipient please notify the sender immediately and destroy this message. Unauthorized use, disclosure or copying of this message is strictly prohibited. The foundation 'Stichting Deltares', which has its seat at Delft, The Netherlands, Commercial Registration Number 41146461, is not liable in any way whatsoever for consequences and/or damages resulting from the improper, incomplete and untimely dispatch, receipt and/or content of this e-mail. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Help improve bot exclusion
On Tue, Oct 30, 2012 at 06:17:05AM -0400, Richard Hipp wrote: Finally: Do you have any further ideas on how to defend a Fossil website against runs such as the two we observed on SQLite last night? This problem affects almost any web software, and I think that job is delegated to robots.txt. Isn't this approach good enough? And in the particular case of the fossil standalone server, it could serve a robots.txt. How do programs like 'viewcvs' or 'viewsvn' deal with that? Regards, Lluís. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Help improve bot exclusion
On Tue, 30 Oct 2012 06:17:05 -0400, Richard Hipp d...@sqlite.org wrote: [...] Both sessions started out innocently. The logs suggest that there really was a human operator initially. But then after about 3 minutes of normal browsing, each session starts downloading every hyperlink in sight at a rate of about 5 to 10 pages per second. It is as if the user had pressed a Download Entire Website button on their browser. Question: Is there such a button in IE? No, just save page as It will not follow hyperlinks, only save html and embedded resources, like images. Another question: Are significant numbers of people still using IE6 and IE7? Could we simply change Fossil to consider IE prior to version 8 to be a bot, and hence not display any hyperlinks until the user has logged in? I don't think it would help much. Newer versions will potentially run the same add-ons. By the way, over 5% of the population still use these older versions. http://stats.wikimedia.org/archive/squid_reports/2012-09/SquidReportClients.htm Yet another question: Is there any other software on Windows that I am not aware of that might be causing the above behaviors? Are there plug-ins or other tools for IE that will walk a website and download all its content? There are several browser add-ons that will try to walk complete websites, e.g.: http://www.winappslist.com/download_managers.htm http://www.unixdaemon.net/ie-plugins.html One can also think of validator tools. Standalone programs usually will not run javascript. Finally: Do you have any further ideas on how to defend a Fossil website against runs such as the two we observed on SQLite last night? Perhaps the href javascript should run onfocus, rather than onload? (untested) Other defenses could use DoS defense techniques, like not honouring (or agressively delay responses to) more than a certain number of requests within a certain time, which is not nice, because the server would have to maintain (more) session state. Sidenote: As far as I can tell several modern browsers have a read ahead option, that will try to load more pages of the site before a link is clicked. https://developers.google.com/chrome/whitepapers/prerender Those will not walk a whole site though. -- Groet, Cordialement, Pozdrawiam, Regards, Kees Nuyt ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Help improve bot exclusion
On Tue, Oct 30, 2012 at 6:23 AM, Lluís Batlle i Rossell vi...@viric.namewrote: On Tue, Oct 30, 2012 at 06:17:05AM -0400, Richard Hipp wrote: Finally: Do you have any further ideas on how to defend a Fossil website against runs such as the two we observed on SQLite last night? This problem affects almost any web software, and I think that job is delegated to robots.txt. Isn't this approach good enough? Robots.txt only works over an entire domain. If your Fossil server is running as CGI within that domain, you can manually modify your robots.txt file to exclude all or part of the fossil URI space. But as that file is not under control of Fossil, you have to make this configuration yourself - Fossil cannot help you. This burden can become acute when you are managing many dozens or even hundreds of Fossil repositories. An automatic system is better. And in the particular case of the fossil standalone server, it could serve a robots.txt. How do programs like 'viewcvs' or 'viewsvn' deal with that? Regards, Lluís. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users -- D. Richard Hipp d...@sqlite.org ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Help improve bot exclusion
Am Dienstag, 30. Oktober 2012, 08:20:14 schrieb Richard Hipp: On Tue, Oct 30, 2012 at 6:23 AM, Lluís Batlle i Rossell vi...@viric.namewrote: On Tue, Oct 30, 2012 at 06:17:05AM -0400, Richard Hipp wrote: Finally: Do you have any further ideas on how to defend a Fossil website against runs such as the two we observed on SQLite last night? This problem affects almost any web software, and I think that job is delegated to robots.txt. Isn't this approach good enough? Robots.txt only works over an entire domain. If your Fossil server is running as CGI within that domain, you can manually modify your robots.txt file to exclude all or part of the fossil URI space. But as that file is not under control of Fossil, you have to make this configuration yourself - Fossil cannot help you. This burden can become acute when you are managing many dozens or even hundreds of Fossil repositories. An automatic system is better. The search engine crawlers do honor the robots meta-tag: http://www.robotstxt.org/meta.html Adding this is a piece of cake (just change the page template), but it doesn't help against malware. -- Bernd Paysan If you want it done right, you have to do it yourself http://bernd-paysan.de/ signature.asc Description: This is a digitally signed message part. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Help improve bot exclusion
[Default] On Tue, 30 Oct 2012 06:17:05 -0400, Richard Hipp d...@sqlite.org wrote: Finally: Do you have any further ideas on how to defend a Fossil website against runs such as the two we observed on SQLite last night? Another suggestion: Include a (mostly invisible, perhaps hard to recognize) logout hyperlink on every page that immediately invalidates the session if it is followed. Users will not see it and not be bothered by it, bots will stumble upon it. -- Groet, Cordialement, Pozdrawiam, Regards, Kees Nuyt ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Help improve bot exclusion
[Default] On Tue, 30 Oct 2012 10:13:47 -0500, Nolan Darilek no...@thewordnerd.info wrote: And, most importantly, don't sacrifice accessibility in the name of excluding bots. Mouseover links are notoriously inaccessible. Same with only adding href on focus via JS rather than on page load. If I tab through a page, that would seem to break keyboard navigation. I agree. I should have been more explicit: run the script when body gets focus, not per hyperlink. -- Groet, Cordialement, Pozdrawiam, Regards, Kees Nuyt ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Re: [fossil-users] Help improve bot exclusion
My guess is that you don't really want to filter out bots, specifically, but really anyone who's attempting to hit every link Fossil makes--that is to say, it's the behavior that we're trying to stop here, not the actor. I suppose what I'd do is set up a mechanism to detect when the remote user is pulling down data too quickly to be a bot/non-abusive person, and when Fossil detects that, send back a blank Whoa, nellie! slow down, human! page for a minute or five. I'd allow the user to configure two thresholds, number of pages per second to trigger this, and number of seconds within a five-minute window that the number of pages per seconds threshold is exceeded. I'd give them defaults of 3 pages per second and 3 times in five minutes. So, for example, if a user hits 3 links in one second, which can happen if you know exactly where you're going and the repository loads quickly, it's ok the first time, even the second, but the third time, it locks you out of the web interface for a little while. Command-line stuff, like cloning/push/pull actions, ought to remain accessible under all circumstances, regardless of the activity on the web UI. What do you think? On 10/30/2012 03:17 AM, Richard Hipp wrote: A Fossil website for a project with a few thousand check-ins can have a lot of hyperlinks. If a spider or bot starts to walk that site, it will visit literally hundreds of thousand or perhaps millions of pages, many of which are things like vdiff and annotate which are computationally expensive to generate or like zip or tarball which give multi-megabyte replies. If you get a lot of bots walking a Fossil site, it can really load down the CPU and run up bandwidth charges. To prevent this, Fossil uses bot-exclustion techniques. First it looks at the USER_AGENT string in the HTTP header and uses that to distinguish bots from humans. Of course, a USER_AGENT string is easily forged, but most bots are honest about who they are so this is a good initial filter. (The undocumented fossil test-ishuman command can be used to experiment with this bot discriminator.) The second line of defense is that hyperlinks are disabled in the transmitted HTML. There is no href= attribute on the a tags. The href= attributes are added by javascript code that runs after the page has been loaded. The idea here is that a bot can easily forge a USER_AGENT string, but running javascript code is a bit more work and even malicious bots don't normally go to that kind of trouble. So, then, to walk a Fossil website, an agent has to (1) present a USER_AGENT string from a known friendly web browser and (2) interpret Javascript. This two-phase defense against bots is usually effective. But last night, a couple of bots got through on the SQLite website. No great damage was done as we have ample bandwidth and CPU reserves to handle this sort of thing. Even so, I'd like to understand how they got through so that I might improve Fossil's defenses. The first run on the SQLite website originated in Chantilly, VA and gave a USER_AGENT string as follows: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET_CLR 2.0.50727; .NET_CLR 3.5.30729; .NET_CLR 3.0.30729; Media_Center_PC 6.0; .NET4.0C; WebMoney_Advisor; MS-RTC_LM_8) The second run came from Berlin and gives this USER_AGENT: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Both sessions started out innocently. The logs suggest that there really was a human operator initially. But then after about 3 minutes of normal browsing, each session starts downloading every hyperlink in sight at a rate of about 5 to 10 pages per second. It is as if the user had pressed a Download Entire Website button on their browser. Question: Is there such a button in IE? Another question: Are significant numbers of people still using IE6 and IE7? Could we simply change Fossil to consider IE prior to version 8 to be a bot, and hence not display any hyperlinks until the user has logged in? Yet another question: Is there any other software on Windows that I am not aware of that might be causing the above behaviors? Are there plug-ins or other tools for IE that will walk a website and download all its content? Finally: Do you have any further ideas on how to defend a Fossil website against runs such as the two we observed on SQLite last night? Tnx for the feedback -- D. Richard Hipp d...@sqlite.org mailto:d...@sqlite.org ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users