URL scanning by bots

André Warnier Tue, 30 Apr 2013 03:04:06 -0700

Dear Apache developers,

This is a suggestion relative to the code of the Apache httpd webserver, and a 
possible
default new default option in the standard distribution of Apache httpd.
It also touches on WWW security, which is why I felt that it belongs on this 
list, rather
than on the general user's list. Please correct me if I am mistaken.


According to Netcraft, there are currently some 600 Million webservers on the 
WWW, with
more than 60% of those identified as "Apache".
I currently administer about 25 Apache httpd/Tomcat of these webservers, not 
remarkable in
any way (business applications for medium-sized companies).
In the logs of these servers, every day, there are episodes like the following :

209.212.145.91 - - [03/Apr/2013:00:52:32 +0200] "GET /muieblackcat HTTP/1.1" 404 362 "-" 
"-"
209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/index.php 
HTTP/1.1" 404 365
"-" "-"
209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/pma/index.php 
HTTP/1.1" 404
369 "-" "-"
209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET 
//admin/phpmyadmin/index.php
HTTP/1.1" 404 376 "-" "-"
209.212.145.91 - - [03/Apr/2013:00:52:37 +0200] "GET //db/index.php HTTP/1.1" 404 362 "-" 
"-"
209.212.145.91 - - [03/Apr/2013:00:52:37 +0200] "GET //dbadmin/index.php 
HTTP/1.1" 404 367
"-" "-"
... etc..

Such lines are the telltale trace of a "URL-scanning bot" or of the 
"URL-scanning" part of
a bot, and I am sure that you are all familiar with them.  Obviously, these 
bots are
trying to find webservers which exhibit poorly-designed or poorly-configured 
applications,
with the aim of identifying hosts which can be submitted to various kinds of 
attacks, for
various purposes.  As far as I can tell from my own unremarkable servers, I 
would surmise
that many or most webservers facing the Internet are submitted to this type of 
scan every
day.

Hopefully, most webservers are not really vulnerable to this type of scan.
But the fact is that *these scans are happening, every day, on millions of 
webservers*.
And they are at least a nuisance, and at worst a serious security problem  
when, as a
result of poorly configured webservers or applications, they lead to break-ins 
and
compromised systems.

It is basically a numbers game, like malicious emails : it costs very little to 
do this,
and if even a tiny proportion of webservers exhibit one of these 
vulnerabilities, because
of the numbers involved, it is worth doing it.
If there are 600 Million webservers, and 50% of them are scanned every day, and 
0.01% of
these webservers are vulnerable because of one of these URLs, then it means 
that every
day, 30,000 (600,000,000 x 0.5 x 0.0001) vulnerable servers will be identified.

About the "cost" aspect : from the data in my own logs, such bots seem to be 
scanning
about  20-30 URLs per pass, at a rate of about 3-4 URLs per second.
Since it is taking my Apache httpd servers approximately 10 ms on average to 
respond (by a
404 Not Found) to one of these requests, and they only request 1 URL per 250 
ms, I would
imagine that these bots have some built-in rate-limiting mechanism, to avoid 
being
"caught" by various webserver-protection tools.  Maybe also they are smart, and 
scan
several servers in parallel, so as to limit the rate at which they "burden" any 
server in
particular. (In this rough calculation, I am ignoring network latency for now).

So if we imagine a smart bot which is scanning 10 servers in parallel, issuing 
4 requests
per second to each of them, for a total of 20 URLs per server, and we assume 
that all
these requests result in 404 responses with an average response time of 10 ms, 
then it
"costs" this bot only about 2 seconds to complete the scan of 10 servers.
If there are 300 Million servers to scan, then the total cost for scanning all 
the
servers, by any number of such bots working cooperatively, is an aggregated 60 
Million
seconds.  And if one of such "botnets" has 10,000 bots, that boils down to only 
6,000
seconds per bot.

Scary, that 50% of all Internet webservers can be scanned for vulnerabilities 
in less than
2 hours, and that such a scan may result in "harvesting" several thousand hosts,
candidates for takeover.

Now, how about making it so that without any special configuration or add-on 
software or
skills on the part of webserver administrators, it would cost these same bots 
*about 100
times as long (several days)* to do their scan ?

The only cost would a relatively small change to the Apache webservers, which 
is what my
suggestion consists of : adding a variable delay (say between 100 ms and 2000 
ms) to any
404 response.

The suggestion is based on the observation that there is a dichotomy between 
this kind of
access by bots, and the kind of access made by legitimate HTTP users/clients : 
legitimate
users/clients (including the "good bots") are accessing mostly links "which 
work", so they
rarely get "404 Not Found" responses.  Malicious URL-scanning bots on the other 
hand, by
the very nature of what they are scanning for, are getting many "404 Not Found" 
responses.

As a general idea thus, anything which impacts the delay to obtain a 404 
response, should
impact these bots much more than it impacts legitimate users/clients.

How much ?

Let us imagine for a moment that this suggestion is implemented in the Apache 
webservers,
and is enabled in the default configuration.  And let's imagine that after a 
while, 20% of
the Apache webservers deployed on the Internet have this feature enabled, and 
are now
delaying any 404 response by an average of 1000 ms.
And let's re-use the numbers above, and redo the calculation.
The same "botnet" of 10,000 bots is thus still scanning 300 Million webservers, 
each bot
scanning 10 servers at a time for 20 URLs per server.  Previously, this took 
about 6000
seconds.
However now, instead of an average delay of 10 ms to obtain a 404 response, in 
20% of the
cases (60 Million webservers) they will experience an average 1000 ms 
additional delay per
URL scanned.
This adds (60,000,000 / 10 * 20 URLs * 1000 ms) 120,000,000 seconds to the scan.
Divided by 10,000 bots, this is 12,000 additional seconds per bot (roughly 3 
1/2 hours).

So with a small change to the code, no add-ons, no special configuration skills 
on the
part of the webserver administrator, no firewalls, no filtering, no need for 
updates to
any list of URLs or bot characteristics, little inconvenience to legitimate 
users/clients,
and a very partial adoption over time, it seems that this scheme could more 
than double
the cost for bots to acquire the same number of targets.  Or, seen another way, 
it could
more than halve the number of webservers being scanned every day.

I know that this is a hard sell.  The basic idea sounds a bit too simple to be 
effective.
It will not kill the bots, and it will not stop the bots from scanning Internet 
servers in
other ways that they use. It does not miraculously protect any single server 
against such
scans, and the benefit of any one server implementing this is diluted over all 
webservers
on the Internet.
But it is also not meant as an absolute weapon.  It is targeted specifically at 
a
particular type of scan done by a particular type of bot for a particular 
purpose, and is
is just a scheme to make this more expensive for them.  It may or may not 
discourage these
bots from continuing with this type of scan (if it does, that would be a very 
big result).
But at the same time, compared to any other kind of tool that can be used 
against these
scans, this one seems really cheap to implement, it does not seem to be easy to
circumvent, and it seems to have at least a potential of bringing big benefits 
to the WWW
at large.

If there are reasonable objections to it, I am quite prepared to accept that, 
and drop it.
 I have already floated the idea in a couple of other places, and gotten what 
could be
described as "tepid" responses.  But it seems to me that most of the 
negative-leaning
responses which I received so far, were more of the a-priori "it will never 
work" kind,
rather than real objections based on real facts.

So my hope here is that someone has the patience to read through this, and 
would have the
additional patience to examine the idea "professionally".

URL scanning by bots

Reply via email to