Re: URL scanning by bots

André Warnier Fri, 03 May 2013 02:38:55 -0700

Ben Reser wrote:

On Tue, Apr 30, 2013 at 5:23 PM, André Warnier <a...@ice-sa.com> wrote:

Alternatives :
1) if you were running such a site (which I would still suppose is a
minority of the 600 Million websites which exist), you could easily disable
the feature.
2) you could instead return a redirect response, to a page saying "that one
was sold, but look at these".
That may be even more friendly to search engines, and to customers.


My point isn't that there aren't alternatives, but that 404's are
legitimate responses that legitimate users can be expected to receive.


I agree that 404's are legitimate responses.
And I agree that legitimate clients/users can expect to receive them.

But if they do receive them when appropriate, but receive them slower than other kinds ofresponses, this is not really "breaking the rules". There is nothing anywhere in the HTTPRFC's which promises a response within a certain timeframe to any kind of HTTP request,and being slow to respond - to any kind of request - is a totally normal occurrence on theWWW.

To push the reasoning a bit further : imagine that hardware and software were still asthey were 10 years ago (with CPU's running at 2-300 MHz e.g.). Then any webserver wouldnever respond to a 404 within less than the kind of delay time I am talking about. Butthat should not invalidate any well-built web application. Any legitimate web applicationwhich would rely on a particular response time for a 404 response would seem to me verybadly-designed. Not to you ?

 As such you'll find it nearly impossible in my opinion to convince
people to degrade performance for them as a default.

Of course, I'm having some difficulties convincing people. What do you think I'm doingright now ? ;-)

But here also, I would like to state things in a more nuanced way. First you'd have todefine "degrading performance". For whom ?Part of my argument is that this could be implemented so as to not degrade the performanceof the webserver very much, or at all (see (*) below).Another part is that the client delay would only really impact such clients which by thenature of what they are doing, are expected indeed to receive a lot of 404 responses.I believe that this includes most URL-scanning bots, but extremely few legitimateclients/users. I cannot prove that, but it seems to me a reasonable assumption. (**)

Then, a more general comment on "degrading performance" : the current situation is thatURL-scanning is degrading performance, for everyone on the whole WWW. These URL-scanningbots serve no useful purpose, except to the criminals who run them. But they are using uptoday a significant portion of the webserver resources and of the general WWW bandwidth,which results in degraded performance for everyone, every day.You may not realise it, but every day you are paying a tax because of these URL-scans :whether you are a server or a client, you are paying for some CPU cycles and some bits/sbandwidth which you are not using yourself; or you are paying for a filter to block themout. Is this something that you feel that you have to just accept, without trying to doanything about it ?



  If it isn't a

default you're hardly any better off than you are today since it will
not be widely deployed.

I agree entirely. Having this deployed as the default option is a vital part of the plan.Otherwise, it would suffer from the same ills as all the very good methods of webserverprotection which already exist : they require special resources to deploy, and inconsequence they are not being widely deployed, despite being available and being effective.


For a more philosophical response, see (***) below.


If you want to see a case where server behavior has been tweaked in
order to combat miscreants go take a look at SMTP.  SMTP is no longer
simple, largely because of the various schemes people have undertaken
to stop spam.  Despite all these schemes, spam still exists and the
only effective counters has been:
1) Securing open-relays.
2) Removing the bot-nets that are sending the spam.
3) Ultimately improving the security of the vulnerable systems that
are sending the spam.

All the effort towards black lists, SPF, domainkeys, etc... has been
IMHO a waste of time.  At best it has been a temporarily road block.


In a way, you are providing more arguments in favor of this 404-delay scheme.
What are the main reasons why many attempts at blocking spam were not very 
succesful ?
I would argue that it was because
a) they were complicated and costly to implement
b) they did not tackle the problem close enough to the source

c) they relied on "a posteriori" information, such as black-lists (before you canblack-list some IP, you first have to have received spam from it, and then you can startbroadcasting this IP, and then it takes a while for the recipients to receive andimplement the information, by which time it is already obsolete)


The scheme which I propose would avoid some of these pitfalls, by

- being very easy to implement (in fact, apart from the original work needed toincorporate the scheme in the webserver code once, it would not require any additionaleffort by anyone- attacking the botnets who do URL-scanning (admittedly as a very small part of theiractivities, but that is the target of this proposal) as close to the source as possible :making their very activity in that respect unprofitable in a global sense- not relying on any prior knowledge of the bots, nor requiring any information to beeither maintained or accessed. The response is 404 ? add the delay (or not,

if you have some valid reason not to).

How much simpler can a scheme be ?

------------
(*) In terms of performance for the server :

The scheme would come into play once the server has already determined that all thatremains to be done to satisfy this request, is to send back a 404 response.

That is a relatively simple task, which does not require the entire resources which areneeded to generate a "normal" response. So this could be off-loaded to a relativelylightweigth "child" or "thread" of the server, and the original request-processing "heavy"thread/child would become free earlier, to process other potentially valuable requests.

I do not know exactly what the overhead would be for passing this task (returning the 404response) from the heavy thread/child to the lightweight thread/child, and if the benefitof freeing the original thread/child earlier would compensate for that overhead.

Intuitively however, I would not be very surprised if this kind of scheme would proveprofitable for the server as a whole, even for a number of other non-404 responses.There are a whole series of non-200 responses which by RFC definition do not include abody, or include a standard body which is the same for all requests with the same statuscode. If I have a prefork Apache server where each child contains a whole perl interpretere.g., why tie all of this up just for sending back a status line to some slow client ?

(**) professionally, for the last 15 years I have been running the technical part of acompany which specialises in information management through web interfaces. In thattimespan, I have designed a lot of web applications, and examined a whole lot more which Ididn't design myself. I have never in that timespan ecountered an application which reliedon a 404 response as "valid". They were always considered as "errors" and treated as suchin the code. That is not a proof that there aren't any, but a reasonable base for myassumption, I believe.


(***)
If you like sweeping comparisons, here is one :

Up to some 40 years ago, large cities such as London, Paris, Los Angeles etc. wereperiodically afflicted by smog, which apart from being disagreeable, was also damagingpeople's health. The problem was that what caused this smog, was also the result of awhole lot of individual activities which on the other hand brought individual peopleprosperity, lower costs and higher standards of living. Nevertheless, at some point awide-enough consensus developed that allowed laws to be passed, which forced people tospend more money (e.g. paying for catalytic converters, smoke scrubbers etc.), but inreturn brought cleaner air over their cities. These laws are not perfect, and affect somepeople more than others, but by and large nobody today in these cities could deny theimprovements in the quality of the air that they breathe.That it did not stop air pollution in general, and that some of the polluting activitiesjust moved somewhere else ? yes, but slowly these other places are also passing laws, andlittle by little the improvement becomes global (or it stops getting much worse than ifnothing had been done).

Without taking myself too seriously, I believe that what I propose is of the same categoryof things. It is a global measure meant to tackle a small fraction of what currentlypollutes the Internet and is an inconvenience and a cost to everyone.

And what distinguishes it from the above laws, is that it doesn't really cost 
anything.

Let my try to provide some elements to substantiate that last sentence :

Let's say that altogether it would cost 5 days of development on the part of one of theApache dev gurus. And let's say that on the other hand it would result, one way oranother, over a period of 2 years, in a global decrease of only 10% of only theURL-scanning activity. What would be the real cost/benefit analysis ?


Let's take a dev cost of 1000$/day. The feature development would thus cost 
5,000$.

The other side is more tricky, but let's use some of the numbers that I have 
used before.

I have 25 servers, and in total these servers receive at least 1000 such individualURL-scanning requests per day, and on average they take at least 10 ms to return such a404 response. So let's say that in aggregate this costs me 10 ms * 1000 = 10 seconds ofserver time per day, over the 25 servers.

A server costs about 2000$ to purchase, and is obsolete in 3 years. To simplify, say that3 years is 1000 days, so it's basic cost is 2$/day. I also pay hosting charges, bandwidth,maintenance, support etc. which raise this cost, say, to 5$/day.

A day has 86400 seconds, so the cost of one server for 1 second is 5$/86400 ~ 
0.00005 $.
So the cost of URL-scanning for my 25 servers, per day, is 0.00005 $ x 10s = 
0.0005 $.

That is ridiculous, for me, not even worth writing about.

But, there are 600 million webservers on the Internet, and by and large, they are allbeing scanned in the same way.

So this is a total cost of 0.0005 $ x 600,000,000 / 25 = 12,000 $ / day.

So if the scheme reduces the amount of URL-scanning by as little as 10%, that would be asaving of 1,200 $/day. It would thus take less than a week to recoup the initialdevelopment costs, and it would be pure profit thereafter, because there is essentiallynothing else to do.


If a company could be set up to do this commercially, would you join me as an 
investor ?

Re: URL scanning by bots

Reply via email to