Re: URL scanning by bots

André Warnier Tue, 30 Apr 2013 17:09:27 -0700

Ben Reser wrote:

On Tue, Apr 30, 2013 at 3:03 AM, André Warnier <a...@ice-sa.com> wrote:

Let us imagine for a moment that this suggestion is implemented in the
Apache webservers,
and is enabled in the default configuration.  And let's imagine that after a
while, 20% of
the Apache webservers deployed on the Internet have this feature enabled,
and are now
delaying any 404 response by an average of 1000 ms.
And let's re-use the numbers above, and redo the calculation.
The same "botnet" of 10,000 bots is thus still scanning 300 Million
webservers, each bot
scanning 10 servers at a time for 20 URLs per server.  Previously, this took
about 6000
seconds.
However now, instead of an average delay of 10 ms to obtain a 404 response,
in 20% of the
cases (60 Million webservers) they will experience an average 1000 ms
additional delay per
URL scanned.
This adds (60,000,000 / 10 * 20 URLs * 1000 ms) 120,000,000 seconds to the
scan.
Divided by 10,000 bots, this is 12,000 additional seconds per bot (roughly 3
1/2 hours).


Let's assume that such a feature gets added, however it's not likely
going to be the default feature.  There are quite a few places that
serve a lot of legitimate soft 404s for reasons that I'm not going to
bother to get into here.


Could you actually give an example of such a "legitimate" use-case ?
(I am not saying that you are wrong, it's just that I genuinely cannot think of 
such a case)

One comment apart from that, is that if there are indeed such sites, I would imagine thatthey are of the kind which is professionally managed, and that it would not be difficultin that case for the administrator to disable (or tune) the feature.


Any site that goes to the trouble of enabling such a feature is
probably not going to be a site that is vulnerable to what these
scanners are looking for.  So if I was a bot writer I'd wait for some
amount of time and if I didn't have a response I'd move on.  I'd also
not just move along with the next scan on your web server, I'd
probably just move on to a different host.  If nothing else a sever
that responds to request slowly is not likely to be interesting to me.

As a result I'd say your suggestion if wildly practiced actually helps
the scanners rather than hurting them, because they can identify hosts
that are unlikely to worth their time scanning with a single request.


Assuming that you meant "widely" ..

Allow me to reply to that (worthy) objection :

In the simple calculations which I indicated initially, I omitted the impact of thenetwork latency, and I used a single figure of 10 ms to estimate the average response timeof a server (for a 404 response).

According to my own experiments, average network latency to reach Internet servers (evenwith standard pings) is of the order of magnitude of at least 50 ms. That is forwell-connected servers.So from the bot client's point of view, to the basic server response time for a singlerequest, you would have to add at least 50 ms on average.


On the other hand, let me disgress a bit to introduce the rest of the answer.

My professional specialty is information management, and many of my customers havedatabases containing URL links to reference pages on the WWW, which they maintain andprovide to their own internal users. From time to time we need to go through theirdatabases, and verify that the links which they have stored are still current.So for these customers we are regularly running programs of the "URL checker" type. Theseare in a way similar to URL-scanning bots, except that they target a longer list of URLs(usually several hundred or thousand), usually distributed over many servers, and theseare real URLs that work (or worked at some point in time).

So anyway, these programs thus try a long list of WWW URLs, and check the type of responsethat they get : if they get a 200 then the link is ok; if they get most anything else thenthe link is flagged as "dubious" in the database, for further manual inspection.Since the program needs to scan many URLs in a reasonable time, it has to use a timeoutfor each URL that it is trying to check. For example, it will issue a request to a server,and if it does not receive a response within (say) 5 seconds, it gives up and flags thelink as dubious.Over many runs of these programs, I have noticed that if I set this timeout much below 5seconds (say 2 seconds), then I get of the order of 30% or more "false dubious" links.In reality most of the time these are working links, but it just so happens that manyservers occasionally do not respond faster than 2 seconds. (And if I re-run the sameprogram with the same parameters immediately afterward, I will again get 30% of slowlinks, and many will be different compared to the previous run).Obviously I cannot do that, because it would mean that my customer has to check hundredsof URLs by hand afterward. So on average the timeout is set at 5 seconds, and this is avalue obtained empirically after many many runs.

What I am leading to is : if the time by which each 404 response is delayed, is randomlyvariable, for example between 50 ms and 2 seconds, then it is very difficult for a bot todetermine if this is a "normal" delay just due to the load on the webserver at thatparticular time, or if this is deliberate, or if this server is just slow in general.And if the bot gets a first response which is fast (or slow), it doesn't really sayanything about how fast or slow the next response would be.

That's what I meant when I stated that this scheme would be hard for a bot to circumvent.I am not saying that it is impossible, but any scheme to circumvent this would need atleast a certain level of sophistication, which again raises the cost.

And now facetiously again, if what you are writing above about bots detecting this anywayand consequently avoiding my own websites was correct, then I would be very happy too,since I would have found a very simple way to have the bots avoid my servers.

Re: URL scanning by bots

Reply via email to