Dear Apache developers, This is a suggestion relative to the code of the Apache httpd webserver, and a possible default new default option in the standard distribution of Apache httpd. It also touches on WWW security, which is why I felt that it belongs on this list, rather than on the general user's list. Please correct me if I am mistaken.
According to Netcraft, there are currently some 600 Million webservers on the WWW, with more than 60% of those identified as "Apache". I currently administer about 25 Apache httpd/Tomcat of these webservers, not remarkable in any way (business applications for medium-sized companies). In the logs of these servers, every day, there are episodes like the following : 209.212.145.91 - - [03/Apr/2013:00:52:32 +0200] "GET /muieblackcat HTTP/1.1" 404 362 "-" "-" 209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/index.php HTTP/1.1" 404 365 "-" "-" 209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/pma/index.php HTTP/1.1" 404 369 "-" "-" 209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/phpmyadmin/index.php HTTP/1.1" 404 376 "-" "-" 209.212.145.91 - - [03/Apr/2013:00:52:37 +0200] "GET //db/index.php HTTP/1.1" 404 362 "-" "-" 209.212.145.91 - - [03/Apr/2013:00:52:37 +0200] "GET //dbadmin/index.php HTTP/1.1" 404 367 "-" "-" ... etc.. Such lines are the telltale trace of a "URL-scanning bot" or of the "URL-scanning" part of a bot, and I am sure that you are all familiar with them. Obviously, these bots are trying to find webservers which exhibit poorly-designed or poorly-configured applications, with the aim of identifying hosts which can be submitted to various kinds of attacks, for various purposes. As far as I can tell from my own unremarkable servers, I would surmise that many or most webservers facing the Internet are submitted to this type of scan every day. Hopefully, most webservers are not really vulnerable to this type of scan. But the fact is that *these scans are happening, every day, on millions of webservers*. And they are at least a nuisance, and at worst a serious security problem when, as a result of poorly configured webservers or applications, they lead to break-ins and compromised systems. It is basically a numbers game, like malicious emails : it costs very little to do this, and if even a tiny proportion of webservers exhibit one of these vulnerabilities, because of the numbers involved, it is worth doing it. If there are 600 Million webservers, and 50% of them are scanned every day, and 0.01% of these webservers are vulnerable because of one of these URLs, then it means that every day, 30,000 (600,000,000 x 0.5 x 0.0001) vulnerable servers will be identified. About the "cost" aspect : from the data in my own logs, such bots seem to be scanning about 20-30 URLs per pass, at a rate of about 3-4 URLs per second. Since it is taking my Apache httpd servers approximately 10 ms on average to respond (by a 404 Not Found) to one of these requests, and they only request 1 URL per 250 ms, I would imagine that these bots have some built-in rate-limiting mechanism, to avoid being "caught" by various webserver-protection tools. Maybe also they are smart, and scan several servers in parallel, so as to limit the rate at which they "burden" any server in particular. (In this rough calculation, I am ignoring network latency for now). So if we imagine a smart bot which is scanning 10 servers in parallel, issuing 4 requests per second to each of them, for a total of 20 URLs per server, and we assume that all these requests result in 404 responses with an average response time of 10 ms, then it "costs" this bot only about 2 seconds to complete the scan of 10 servers. If there are 300 Million servers to scan, then the total cost for scanning all the servers, by any number of such bots working cooperatively, is an aggregated 60 Million seconds. And if one of such "botnets" has 10,000 bots, that boils down to only 6,000 seconds per bot. Scary, that 50% of all Internet webservers can be scanned for vulnerabilities in less than 2 hours, and that such a scan may result in "harvesting" several thousand hosts, candidates for takeover. Now, how about making it so that without any special configuration or add-on software or skills on the part of webserver administrators, it would cost these same bots *about 100 times as long (several days)* to do their scan ? The only cost would a relatively small change to the Apache webservers, which is what my suggestion consists of : adding a variable delay (say between 100 ms and 2000 ms) to any 404 response. The suggestion is based on the observation that there is a dichotomy between this kind of access by bots, and the kind of access made by legitimate HTTP users/clients : legitimate users/clients (including the "good bots") are accessing mostly links "which work", so they rarely get "404 Not Found" responses. Malicious URL-scanning bots on the other hand, by the very nature of what they are scanning for, are getting many "404 Not Found" responses. As a general idea thus, anything which impacts the delay to obtain a 404 response, should impact these bots much more than it impacts legitimate users/clients. How much ? Let us imagine for a moment that this suggestion is implemented in the Apache webservers, and is enabled in the default configuration. And let's imagine that after a while, 20% of the Apache webservers deployed on the Internet have this feature enabled, and are now delaying any 404 response by an average of 1000 ms. And let's re-use the numbers above, and redo the calculation. The same "botnet" of 10,000 bots is thus still scanning 300 Million webservers, each bot scanning 10 servers at a time for 20 URLs per server. Previously, this took about 6000 seconds. However now, instead of an average delay of 10 ms to obtain a 404 response, in 20% of the cases (60 Million webservers) they will experience an average 1000 ms additional delay per URL scanned. This adds (60,000,000 / 10 * 20 URLs * 1000 ms) 120,000,000 seconds to the scan. Divided by 10,000 bots, this is 12,000 additional seconds per bot (roughly 3 1/2 hours). So with a small change to the code, no add-ons, no special configuration skills on the part of the webserver administrator, no firewalls, no filtering, no need for updates to any list of URLs or bot characteristics, little inconvenience to legitimate users/clients, and a very partial adoption over time, it seems that this scheme could more than double the cost for bots to acquire the same number of targets. Or, seen another way, it could more than halve the number of webservers being scanned every day. I know that this is a hard sell. The basic idea sounds a bit too simple to be effective. It will not kill the bots, and it will not stop the bots from scanning Internet servers in other ways that they use. It does not miraculously protect any single server against such scans, and the benefit of any one server implementing this is diluted over all webservers on the Internet. But it is also not meant as an absolute weapon. It is targeted specifically at a particular type of scan done by a particular type of bot for a particular purpose, and is is just a scheme to make this more expensive for them. It may or may not discourage these bots from continuing with this type of scan (if it does, that would be a very big result). But at the same time, compared to any other kind of tool that can be used against these scans, this one seems really cheap to implement, it does not seem to be easy to circumvent, and it seems to have at least a potential of bringing big benefits to the WWW at large. If there are reasonable objections to it, I am quite prepared to accept that, and drop it. I have already floated the idea in a couple of other places, and gotten what could be described as "tepid" responses. But it seems to me that most of the negative-leaning responses which I received so far, were more of the a-priori "it will never work" kind, rather than real objections based on real facts. So my hope here is that someone has the patience to read through this, and would have the additional patience to examine the idea "professionally".