Re: Think before you write Semantic Web crawlers

Dave Challis Wed, 22 Jun 2011 13:53:35 -0700

On 22/06/11 16:05, Kingsley Idehen wrote:

On 6/22/11 3:57 PM, Steve Harris wrote:

Yes, exactly.


I think that the problem is at least partly (and I say this as an
ex-academic) that few people in academia have the slightest idea how
much it costs to run a farm of servers in the Real World™.

From the point of view of the crawler they're trying to get as much
data as possible in a short a time as possible, but don't realise that
the poor guy at the other end just got his 95th percentile shot
through the roof, and now has a several thousand dollar bandwidth bill
heading his way.

You can cap bandwidth, but that then might annoy paying customers,
which is clearly not good.


Yes, so we need QoS algorithms or heuristics capable of fine-grained
partitioning re. Who can do What, When, and Where :-)

Kingsley

There are plenty of these around when it comes to web traffic ingeneral. For apache, I can think of ModSecurity(http://www.modsecurity.org/) and mod_evasive(http://www.zdziarski.com/blog/?page_id=442).

Both of these will look at traffic patterns and dynamically blacklist asneeded.

ModSecurity also allows for custom rules to be written depending onget/post content, so it should be perfectly feasible to set up rulesbased on estimated/actual query cost (e.g. blacklist if client makes > Xrequests per Y mins which return > Z triples).

Can't see any reason why a hybrid approach couldn't be used, e.g. applyrules to unauthenticated traffic, and auto-whitelist clients identifyingthemselves via WebID.


--
Dave Challis
[email protected]

Re: Think before you write Semantic Web crawlers

Reply via email to