Re: Think before you write Semantic Web crawlers

Kingsley Idehen Wed, 22 Jun 2011 15:17:31 -0700

On 6/22/11 4:51 PM, Dave Challis wrote:

On 22/06/11 16:05, Kingsley Idehen wrote:

On 6/22/11 3:57 PM, Steve Harris wrote:

Yes, exactly.


I think that the problem is at least partly (and I say this as an
ex-academic) that few people in academia have the slightest idea how
much it costs to run a farm of servers in the Real World™.

From the point of view of the crawler they're trying to get as much
data as possible in a short a time as possible, but don't realise that
the poor guy at the other end just got his 95th percentile shot
through the roof, and now has a several thousand dollar bandwidth bill
heading his way.

You can cap bandwidth, but that then might annoy paying customers,
which is clearly not good.


Yes, so we need QoS algorithms or heuristics capable of fine-grained
partitioning re. Who can do What, When, and Where :-)

Kingsley

There are plenty of these around when it comes to web traffic ingeneral. For apache, I can think of ModSecurity(http://www.modsecurity.org/) and mod_evasive(http://www.zdziarski.com/blog/?page_id=442).

Both of these will look at traffic patterns and dynamically blacklistas needed.

How do they deal with "Who" without throwing baby out with bath waterre. Linked Data ?

Innocent Linked Data consumer triggers a transitive crawl, all othervisitors from that IP get on a blacklist? Nobody meant any harm. InRDBMS realm would it be reasonable to take any of the following actions:

1. Cut off marketing because someone triggered: SELECT * from Customers, as part of MS Query or MS Access usage2. Cut off sales and/or marketing because as part of trying to grok SQLjoins they generated a lot of Cartesian products.

You need granularity within the data access technology itself. WebIDoffers that to Linked Data. Linked Data is the evolution hitting the Weband redefining crawling in the process.

ModSecurity also allows for custom rules to be written depending onget/post content, so it should be perfectly feasible to set up rulesbased on estimated/actual query cost (e.g. blacklist if client makes >X requests per Y mins which return > Z triples).

How does it know about:http://kingsley.idehen.net/dataspace/person#this, For better or forworse re. QoS?

Can't see any reason why a hybrid approach couldn't be used, e.g.apply rules to unauthenticated traffic, and auto-whitelist clientsidentifying themselves via WebID.

Of course a hybrid system is how it has to work. WebID isn't a silverbullet, nothing is. Hence the need for heuristics and algorithms. WebIDis just a critical factor, ditto Trust Logic.



--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Think before you write Semantic Web crawlers

Reply via email to