Re: Broken Links in LOD Data Sets

Kingsley Idehen Thu, 05 Feb 2009 09:42:25 -0800


On 2/5/09 10:35 AM, Bernhard Haslhofer wrote:

Hi all,
we are currently working on the question how to deal with brokenlinks/references between resources in (distinct) LOD data sets andwould like to know your opinion on that issue. If there is some workgoing on into this direction, please let me know.
I think I do not really need to explain the problem. Everybody knowsit from the "human" Web when you follow a link and you get an annoying404 response.
If we assume that the consumers of LOD data are not humans butapplications, broken links/references are not only "annoying" butcould lead to severe processing errors if an application relies on akind of "referential integrity".
Assume we have an LOD data source X exposing resources that describeimages and these images are linked with resources in DBPedia (e.g.,http://dbpedia.org/resource/Berlin). An application built on-top of Xfollows links to retrieve the geo-coordinates in order to display theimages on a virtual map. If now, for some reason, the URL of thelinked DB-Pedia resource changes either because DBPedia is moved orre-organized, which I guess could happen to any LOD source in along-term perspective, the application might crash if doesn't considerthat referenced resources might move or become unavailable.
I know that "cool URIs don't change" but I am not sure if thisassumption holds in practice, especially in a long-term perspective.

Well Null Pointers have dogged programmers for eons.

You have to test for Null Pointers (URIs) when programming for theLinked Data Web too.

Now if a pointer is Null, you have to think about (in an applicationspecific way) how to locate the same values from elsewhere, and thendecide if (should you find what you seek) how to persist in your ownspace (i.e. make a new URI for these values).


You can do many things with a 404 condition courtesy of SPARQL.

Coincidentally, I touched on this resilience matter during the BeijingLinked Data Workshop, as per this excerpt from Orri's blog post [1],following the workshop:


"What to do when identity expires?

Giovanni of Sindice said that a document should be removed from searchif it was no longer available. Kingsley pointed out that resilience ofreference requires some way to recover data. The data web cannot be lessresilient than the document web, and there is a point to having accessto history. He recommended hooking up with the Internet Archive, sincethey make long term persistence their business. In this way, if anapplication depends on data, and the URIs on which it depends are nolonger dereferenceable or or provide content from a new owner of thedomain, those who need the old version can still get it and host itthemselves."

For the "human" Web several solutions have been proposed, e.g.,
1.) PURL and DOI services for translating URNs into resolvable URLs
2.) forward references
3.) robust link implementations, i.e., with each link you keep a setof related search terms to retrieve moved / changed resources
4.) observer / notification mechanisms
X.) ?

All nice ideas. Usage will be application and scenario specific, naturally.

I guess (1) is not really applicable for LOD resources because ofscalability and single-point of failure issues.

If you take a closer look at the federation that EC2 accords, and how weare making it easy for anyone to have their Linked Data drivenKnowledgebases for personal and service specific use [2], you might spota little nuance: we always link back to an original source data objectURI (a form of intrinsic Attribution by Reference). The idea being thatthis kind of federation ultimately builds up URI resilience in a mannerthat's similar to general Internet resilience (you can slow it down orinconvenience it, but never erase it due to "scale free" attribute ofreal federation).

(2) would require that LOD providers take care of setting up HTTPredirects for their moved resources - no idea if anybody will do thatin reality and how this can scale. (3) could help to re-locate movedresources via search engines like Sindice but not really fullyautomatically. (4) could at least inform a data source that certainreferences are broken and it could remove them.
Another alternative is of course to completely leave the problem tothe application developers, which means that they must consider that areferenced resource might exist or not. I am not sure about thepractical consequences of that approach, especially if several datasources are involved, but I have the feeling that it is getting reallycomplicated if one cannot rely on any kind of referential integrity.

In a nutshell, yes but this is about data architects and developersworking in concert as part of product and service delivery.

Are there any existing mechanism that can give us at least some basicfeedback about the "quality" of an LOD data source? I think, thereferential integrity could be such a quality property...

In an "Open World" the notion of "Quality" is inherently "Subjective".The "Beauty & Beholder" rules apply at all scales in our universe :-)


Links:

1.http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/13472. http://dbpedia2.openlinksw.com:8895/resource/Berlin - localizedde-referencing and attribution link to source via ow:sameAs (all EC2versions of DBpedia, Bio2Rdf, NeuroCommons, and MusicBrainz get this.Ditto the imminent Virtuoso Cluster Edition hosted LOD Cloud)3.http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtInstallationEC2- EC2 AMI Home Page



Kingsley


Thanks for your input on that issue,

Bernhard

______________________________________________________
Research Group Multimedia Information Systems
Department of Distributed and Multimedia Systems
Faculty of Computer Science
University of Vienna

Postal Address: Liebiggasse 4/3-4, 1010 Vienna, Austria
Phone: +43 1 42 77 39635 Fax: +43 1 4277 39649
E-Mail: [email protected]
WWW: http://www.cs.univie.ac.at/bernhard.haslhofer



--


Regards,

Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen
President&  CEO
OpenLink Software     Web: http://www.openlinksw.com

Re: Broken Links in LOD Data Sets

Reply via email to