Ben Reser wrote:
On Tue, Apr 30, 2013 at 5:23 PM, André Warnier <a...@ice-sa.com> wrote:
Alternatives :
1) if you were running such a site (which I would still suppose is a
minority of the 600 Million websites which exist), you could easily disable
the feature.
2) you could instead return a redirect response, to a page saying "that one
was sold, but look at these".
That may be even more friendly to search engines, and to customers.

My point isn't that there aren't alternatives, but that 404's are
legitimate responses that legitimate users can be expected to receive.

I agree that 404's are legitimate responses.
And I agree that legitimate clients/users can expect to receive them.
But if they do receive them when appropriate, but receive them slower than other kinds of responses, this is not really "breaking the rules". There is nothing anywhere in the HTTP RFC's which promises a response within a certain timeframe to any kind of HTTP request, and being slow to respond - to any kind of request - is a totally normal occurrence on the WWW.

To push the reasoning a bit further : imagine that hardware and software were still as they were 10 years ago (with CPU's running at 2-300 MHz e.g.). Then any webserver would never respond to a 404 within less than the kind of delay time I am talking about. But that should not invalidate any well-built web application. Any legitimate web application which would rely on a particular response time for a 404 response would seem to me very badly-designed. Not to you ?

 As such you'll find it nearly impossible in my opinion to convince
people to degrade performance for them as a default.

Of course, I'm having some difficulties convincing people. What do you think I'm doing right now ? ;-)

But here also, I would like to state things in a more nuanced way. First you'd have to define "degrading performance". For whom ? Part of my argument is that this could be implemented so as to not degrade the performance of the webserver very much, or at all (see (*) below). Another part is that the client delay would only really impact such clients which by the nature of what they are doing, are expected indeed to receive a lot of 404 responses. I believe that this includes most URL-scanning bots, but extremely few legitimate clients/users. I cannot prove that, but it seems to me a reasonable assumption. (**)

Then, a more general comment on "degrading performance" : the current situation is that URL-scanning is degrading performance, for everyone on the whole WWW. These URL-scanning bots serve no useful purpose, except to the criminals who run them. But they are using up today a significant portion of the webserver resources and of the general WWW bandwidth, which results in degraded performance for everyone, every day. You may not realise it, but every day you are paying a tax because of these URL-scans : whether you are a server or a client, you are paying for some CPU cycles and some bits/s bandwidth which you are not using yourself; or you are paying for a filter to block them out. Is this something that you feel that you have to just accept, without trying to do anything about it ?


  If it isn't a
default you're hardly any better off than you are today since it will
not be widely deployed.

I agree entirely. Having this deployed as the default option is a vital part of the plan. Otherwise, it would suffer from the same ills as all the very good methods of webserver protection which already exist : they require special resources to deploy, and in consequence they are not being widely deployed, despite being available and being effective.

For a more philosophical response, see (***) below.


If you want to see a case where server behavior has been tweaked in
order to combat miscreants go take a look at SMTP.  SMTP is no longer
simple, largely because of the various schemes people have undertaken
to stop spam.  Despite all these schemes, spam still exists and the
only effective counters has been:
1) Securing open-relays.
2) Removing the bot-nets that are sending the spam.
3) Ultimately improving the security of the vulnerable systems that
are sending the spam.

All the effort towards black lists, SPF, domainkeys, etc... has been
IMHO a waste of time.  At best it has been a temporarily road block.


In a way, you are providing more arguments in favor of this 404-delay scheme.
What are the main reasons why many attempts at blocking spam were not very 
succesful ?
I would argue that it was because
a) they were complicated and costly to implement
b) they did not tackle the problem close enough to the source
c) they relied on "a posteriori" information, such as black-lists (before you can black-list some IP, you first have to have received spam from it, and then you can start broadcasting this IP, and then it takes a while for the recipients to receive and implement the information, by which time it is already obsolete)

The scheme which I propose would avoid some of these pitfalls, by
- being very easy to implement (in fact, apart from the original work needed to incorporate the scheme in the webserver code once, it would not require any additional effort by anyone - attacking the botnets who do URL-scanning (admittedly as a very small part of their activities, but that is the target of this proposal) as close to the source as possible : making their very activity in that respect unprofitable in a global sense - not relying on any prior knowledge of the bots, nor requiring any information to be either maintained or accessed. The response is 404 ? add the delay (or not,
if you have some valid reason not to).

How much simpler can a scheme be ?

------------
(*) In terms of performance for the server :
The scheme would come into play once the server has already determined that all that remains to be done to satisfy this request, is to send back a 404 response.

That is a relatively simple task, which does not require the entire resources which are needed to generate a "normal" response. So this could be off-loaded to a relatively lightweigth "child" or "thread" of the server, and the original request-processing "heavy" thread/child would become free earlier, to process other potentially valuable requests.

I do not know exactly what the overhead would be for passing this task (returning the 404 response) from the heavy thread/child to the lightweight thread/child, and if the benefit of freeing the original thread/child earlier would compensate for that overhead.

Intuitively however, I would not be very surprised if this kind of scheme would prove profitable for the server as a whole, even for a number of other non-404 responses. There are a whole series of non-200 responses which by RFC definition do not include a body, or include a standard body which is the same for all requests with the same status code. If I have a prefork Apache server where each child contains a whole perl interpreter e.g., why tie all of this up just for sending back a status line to some slow client ?

(**) professionally, for the last 15 years I have been running the technical part of a company which specialises in information management through web interfaces. In that timespan, I have designed a lot of web applications, and examined a whole lot more which I didn't design myself. I have never in that timespan ecountered an application which relied on a 404 response as "valid". They were always considered as "errors" and treated as such in the code. That is not a proof that there aren't any, but a reasonable base for my assumption, I believe.

(***)
If you like sweeping comparisons, here is one :

Up to some 40 years ago, large cities such as London, Paris, Los Angeles etc. were periodically afflicted by smog, which apart from being disagreeable, was also damaging people's health. The problem was that what caused this smog, was also the result of a whole lot of individual activities which on the other hand brought individual people prosperity, lower costs and higher standards of living. Nevertheless, at some point a wide-enough consensus developed that allowed laws to be passed, which forced people to spend more money (e.g. paying for catalytic converters, smoke scrubbers etc.), but in return brought cleaner air over their cities. These laws are not perfect, and affect some people more than others, but by and large nobody today in these cities could deny the improvements in the quality of the air that they breathe. That it did not stop air pollution in general, and that some of the polluting activities just moved somewhere else ? yes, but slowly these other places are also passing laws, and little by little the improvement becomes global (or it stops getting much worse than if nothing had been done).

Without taking myself too seriously, I believe that what I propose is of the same category of things. It is a global measure meant to tackle a small fraction of what currently pollutes the Internet and is an inconvenience and a cost to everyone.
And what distinguishes it from the above laws, is that it doesn't really cost 
anything.

Let my try to provide some elements to substantiate that last sentence :

Let's say that altogether it would cost 5 days of development on the part of one of the Apache dev gurus. And let's say that on the other hand it would result, one way or another, over a period of 2 years, in a global decrease of only 10% of only the URL-scanning activity. What would be the real cost/benefit analysis ?

Let's take a dev cost of 1000$/day. The feature development would thus cost 
5,000$.

The other side is more tricky, but let's use some of the numbers that I have 
used before.
I have 25 servers, and in total these servers receive at least 1000 such individual URL-scanning requests per day, and on average they take at least 10 ms to return such a 404 response. So let's say that in aggregate this costs me 10 ms * 1000 = 10 seconds of server time per day, over the 25 servers.

A server costs about 2000$ to purchase, and is obsolete in 3 years. To simplify, say that 3 years is 1000 days, so it's basic cost is 2$/day. I also pay hosting charges, bandwidth, maintenance, support etc. which raise this cost, say, to 5$/day.
A day has 86400 seconds, so the cost of one server for 1 second is 5$/86400 ~ 
0.00005 $.
So the cost of URL-scanning for my 25 servers, per day, is 0.00005 $ x 10s = 
0.0005 $.

That is ridiculous, for me, not even worth writing about.

But, there are 600 million webservers on the Internet, and by and large, they are all being scanned in the same way.
So this is a total cost of 0.0005 $ x 600,000,000 / 25 = 12,000 $ / day.

So if the scheme reduces the amount of URL-scanning by as little as 10%, that would be a saving of 1,200 $/day. It would thus take less than a week to recoup the initial development costs, and it would be pure profit thereafter, because there is essentially nothing else to do.

If a company could be set up to do this commercially, would you join me as an 
investor ?



Reply via email to