On 22/06/11 16:05, Kingsley Idehen wrote:
On 6/22/11 3:57 PM, Steve Harris wrote:
Yes, exactly.
I think that the problem is at least partly (and I say this as an
ex-academic) that few people in academia have the slightest idea how
much it costs to run a farm of servers in the Real World™.
From the point of view of the crawler they're trying to get as much
data as possible in a short a time as possible, but don't realise that
the poor guy at the other end just got his 95th percentile shot
through the roof, and now has a several thousand dollar bandwidth bill
heading his way.
You can cap bandwidth, but that then might annoy paying customers,
which is clearly not good.
Yes, so we need QoS algorithms or heuristics capable of fine-grained
partitioning re. Who can do What, When, and Where :-)
Kingsley
There are plenty of these around when it comes to web traffic in
general. For apache, I can think of ModSecurity
(http://www.modsecurity.org/) and mod_evasive
(http://www.zdziarski.com/blog/?page_id=442).
Both of these will look at traffic patterns and dynamically blacklist as
needed.
ModSecurity also allows for custom rules to be written depending on
get/post content, so it should be perfectly feasible to set up rules
based on estimated/actual query cost (e.g. blacklist if client makes > X
requests per Y mins which return > Z triples).
Can't see any reason why a hybrid approach couldn't be used, e.g. apply
rules to unauthenticated traffic, and auto-whitelist clients identifying
themselves via WebID.
--
Dave Challis
[email protected]