On 6/22/11 4:51 PM, Dave Challis wrote:
On 22/06/11 16:05, Kingsley Idehen wrote:
On 6/22/11 3:57 PM, Steve Harris wrote:
Yes, exactly.

I think that the problem is at least partly (and I say this as an
ex-academic) that few people in academia have the slightest idea how
much it costs to run a farm of servers in the Real World™.

From the point of view of the crawler they're trying to get as much
data as possible in a short a time as possible, but don't realise that
the poor guy at the other end just got his 95th percentile shot
through the roof, and now has a several thousand dollar bandwidth bill
heading his way.

You can cap bandwidth, but that then might annoy paying customers,
which is clearly not good.

Yes, so we need QoS algorithms or heuristics capable of fine-grained
partitioning re. Who can do What, When, and Where :-)

Kingsley

There are plenty of these around when it comes to web traffic in general. For apache, I can think of ModSecurity (http://www.modsecurity.org/) and mod_evasive (http://www.zdziarski.com/blog/?page_id=442).

Both of these will look at traffic patterns and dynamically blacklist as needed.

How do they deal with "Who" without throwing baby out with bath water re. Linked Data ?

Innocent Linked Data consumer triggers a transitive crawl, all other visitors from that IP get on a blacklist? Nobody meant any harm. In RDBMS realm would it be reasonable to take any of the following actions:

1. Cut off marketing because someone triggered: SELECT * from Customers , as part of MS Query or MS Access usage 2. Cut off sales and/or marketing because as part of trying to grok SQL joins they generated a lot of Cartesian products.

You need granularity within the data access technology itself. WebID offers that to Linked Data. Linked Data is the evolution hitting the Web and redefining crawling in the process.


ModSecurity also allows for custom rules to be written depending on get/post content, so it should be perfectly feasible to set up rules based on estimated/actual query cost (e.g. blacklist if client makes > X requests per Y mins which return > Z triples).

How does it know about: http://kingsley.idehen.net/dataspace/person#this, For better or for worse re. QoS?


Can't see any reason why a hybrid approach couldn't be used, e.g. apply rules to unauthenticated traffic, and auto-whitelist clients identifying themselves via WebID.

Of course a hybrid system is how it has to work. WebID isn't a silver bullet, nothing is. Hence the need for heuristics and algorithms. WebID is just a critical factor, ditto Trust Logic.


--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen






Reply via email to