On 2/5/09 10:35 AM, Bernhard Haslhofer wrote:
Hi all,
we are currently working on the question how to deal with broken
links/references between resources in (distinct) LOD data sets and
would like to know your opinion on that issue. If there is some work
going on into this direction, please let me know.
I think I do not really need to explain the problem. Everybody knows
it from the "human" Web when you follow a link and you get an annoying
404 response.
If we assume that the consumers of LOD data are not humans but
applications, broken links/references are not only "annoying" but
could lead to severe processing errors if an application relies on a
kind of "referential integrity".
Assume we have an LOD data source X exposing resources that describe
images and these images are linked with resources in DBPedia (e.g.,
http://dbpedia.org/resource/Berlin). An application built on-top of X
follows links to retrieve the geo-coordinates in order to display the
images on a virtual map. If now, for some reason, the URL of the
linked DB-Pedia resource changes either because DBPedia is moved or
re-organized, which I guess could happen to any LOD source in a
long-term perspective, the application might crash if doesn't consider
that referenced resources might move or become unavailable.
I know that "cool URIs don't change" but I am not sure if this
assumption holds in practice, especially in a long-term perspective.
Well Null Pointers have dogged programmers for eons.
You have to test for Null Pointers (URIs) when programming for the
Linked Data Web too.
Now if a pointer is Null, you have to think about (in an application
specific way) how to locate the same values from elsewhere, and then
decide if (should you find what you seek) how to persist in your own
space (i.e. make a new URI for these values).
You can do many things with a 404 condition courtesy of SPARQL.
Coincidentally, I touched on this resilience matter during the Beijing
Linked Data Workshop, as per this excerpt from Orri's blog post [1],
following the workshop:
"What to do when identity expires?
Giovanni of Sindice said that a document should be removed from search
if it was no longer available. Kingsley pointed out that resilience of
reference requires some way to recover data. The data web cannot be less
resilient than the document web, and there is a point to having access
to history. He recommended hooking up with the Internet Archive, since
they make long term persistence their business. In this way, if an
application depends on data, and the URIs on which it depends are no
longer dereferenceable or or provide content from a new owner of the
domain, those who need the old version can still get it and host it
themselves."
For the "human" Web several solutions have been proposed, e.g.,
1.) PURL and DOI services for translating URNs into resolvable URLs
2.) forward references
3.) robust link implementations, i.e., with each link you keep a set
of related search terms to retrieve moved / changed resources
4.) observer / notification mechanisms
X.) ?
All nice ideas. Usage will be application and scenario specific, naturally.
I guess (1) is not really applicable for LOD resources because of
scalability and single-point of failure issues.
If you take a closer look at the federation that EC2 accords, and how we
are making it easy for anyone to have their Linked Data driven
Knowledgebases for personal and service specific use [2], you might spot
a little nuance: we always link back to an original source data object
URI (a form of intrinsic Attribution by Reference). The idea being that
this kind of federation ultimately builds up URI resilience in a manner
that's similar to general Internet resilience (you can slow it down or
inconvenience it, but never erase it due to "scale free" attribute of
real federation).
(2) would require that LOD providers take care of setting up HTTP
redirects for their moved resources - no idea if anybody will do that
in reality and how this can scale. (3) could help to re-locate moved
resources via search engines like Sindice but not really fully
automatically. (4) could at least inform a data source that certain
references are broken and it could remove them.
Another alternative is of course to completely leave the problem to
the application developers, which means that they must consider that a
referenced resource might exist or not. I am not sure about the
practical consequences of that approach, especially if several data
sources are involved, but I have the feeling that it is getting really
complicated if one cannot rely on any kind of referential integrity.
In a nutshell, yes but this is about data architects and developers
working in concert as part of product and service delivery.
Are there any existing mechanism that can give us at least some basic
feedback about the "quality" of an LOD data source? I think, the
referential integrity could be such a quality property...
In an "Open World" the notion of "Quality" is inherently "Subjective".
The "Beauty & Beholder" rules apply at all scales in our universe :-)
Links:
1.
http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1347
2. http://dbpedia2.openlinksw.com:8895/resource/Berlin - localized
de-referencing and attribution link to source via ow:sameAs (all EC2
versions of DBpedia, Bio2Rdf, NeuroCommons, and MusicBrainz get this.
Ditto the imminent Virtuoso Cluster Edition hosted LOD Cloud)
3.
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtInstallationEC2
- EC2 AMI Home Page
Kingsley
Thanks for your input on that issue,
Bernhard
______________________________________________________
Research Group Multimedia Information Systems
Department of Distributed and Multimedia Systems
Faculty of Computer Science
University of Vienna
Postal Address: Liebiggasse 4/3-4, 1010 Vienna, Austria
Phone: +43 1 42 77 39635 Fax: +43 1 4277 39649
E-Mail: [email protected]
WWW: http://www.cs.univie.ac.at/bernhard.haslhofer
--
Regards,
Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen
President& CEO
OpenLink Software Web: http://www.openlinksw.com