Re: Think before you write Semantic Web crawlers

Henry Story Thu, 23 Jun 2011 01:43:46 -0700

On 23 Jun 2011, at 10:20, Michael Brunnbauer wrote:

> 
> re
> 
> On Thu, Jun 23, 2011 at 10:09:25AM +0200, Martin Hepp wrote:
>> Yes, WebID is out of question a good thing. I am not entirely sure, though, 
>> that you can make it a mandatory requirement for access to your site, 
>> because if a few major consumers do not use WebID for their crawlers, 
>> site-owners cannot block anonymous crawlers.
> 
> Google, Bing and Yahoo Authenticate themself via DNS: Do a reverse lookup for
> the IP, check for some well known domains and then do a forward lookup of the
> hostname and check if it matches the IP. Much simpler to implement than WebID.
> 
> config = {
> 'Googlebot':['googlebot.com'],
> 'Mediapartners-Google':['googlebot.com'],
> 'msnbot':['live.com','msn.com','bing.com'],
> 'bingbot':['live.com','msn.com','bing.com'],
> 'Yahoo! Slurp':['yahoo.com','yahoo.net']
> }


That looks simple like that. But when things start scaling this becomes a full 
time job. At AltaVista
there was a person dedicated to dealing with these types of rules (even when 
the company was under 50 people), 
to work out who was abusing the system, what types of throttles to apply etc. 
At the time the big issue was that a huge portion of the webtraffic came from a 
few AOL ip addresses. If someone misbehaved there would you throttle all of 
AOL? The same is certainly true now in much larger number. Are you going to 
throttle a whole ip block because of the bad behaviour of one individual? And 
as you see the above is still reasonable when the number of bots are limited to 
massive crawlers. But when everyone can crawl, working out who is who through 
IP addresses will not be possible.

Sure WebId is not implemented widely yet. But the semweb has the most to gain 
by its adoption, since it ties right into linked data - it was originally 
called foaf+ssl! WebID is not that difficult to implement, and since these data 
sets are being placed online to test the skills and quality of the engineers, 
why not put a few datasets online protected in different ways with WebIDs? This 
will help build knowledge:

 - to protect web services with WebID
 - to build clients that use WebID
 - to get feedback on how crawlers are behaving

(If you are worried about anonymity, you could have your crawler use a WebID 
that cannot be traced to an institution, and later when you have collected the 
data prove that you are in control of that WebID.)

So for crawler writers giving their crawler a webid is half a days work to get 
going. We have written WebId implementations for servers in a day or two. Of 
course in both cases one can always tune and tune and tune. But you can get 
going really quickly.

Henry


> 
> Regards,
> 
> Michael Brunnbauer
> 
> -- 
> ++  Michael Brunnbauer
> ++  netEstate GmbH
> ++  Geisenhausener Straße 11a
> ++  81379 München
> ++  Tel +49 89 32 19 77 80
> ++  Fax +49 89 32 19 77 89 
> ++  E-Mail [email protected]
> ++  http://www.netestate.de/
> ++
> ++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
> ++  USt-IdNr. DE221033342
> ++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
> ++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
> 

Social Web Architect
http://bblfish.net/

Re: Think before you write Semantic Web crawlers

Reply via email to